Error while running model training in google cloud ml - machine-learning

I want to run model training in the cloud. I am following this link which runs a sample code to train a model based on flower dataset. The tutorial consists of 4 stages:
Set up your Cloud Storage bucket
Preprocessing training and evaluation data in the cloud
Run model training in the cloud
Deploying and using the model for prediction
I was able to complete step 1 and 2, however in step 3, job is successfully submitted but somehow error occurs and task exits with non exit status 1. Here is the log of the task
Screenshot of expanded log is:
I used following command:
gcloud ml-engine jobs submit training test${JOB_ID} \
--stream-logs \
--module-name trainer.task \
--package-path trainer\
--staging-bucket ${BUCKET_NAME} \
--region us-central1 \
--runtime-version=1.2 \
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*"
Thanks in advance!

Can you please confirm that the input files (eval_data_paths and train_data_paths) are not empty? Additionally if you are still having issues can you please file an issue https://github.com/GoogleCloudPlatform/cloudml-samples since its easier to handle the issue on Github.

I met the same issue and couldn't figure out, then I followed this, do it again from git clone and there was no error after running on gcs.

It is clear from your error message
The replica worker 1 exited with a non-zero status of 1. Termination reason: Error
that you have some programming error (syntax, undefined etc).
For more information, Check the return code and meaning
Return code -------------Meaning-------------- Cloud ML Engine response
0 Successful completion Shuts down and releases job resources.
1-128 Unrecoverable error Ends the job and logs the error.
Your need to find your bug first and fix it, then try again.
I recommend run your task locally (if your configuration supports) before you submit in cloud. If you find any bug, you can fix easily in your local machine.

Related

Google Endpoints YAML file update: Is there a simpler method

When using Google Endpoints with Cloud Run to provide the container service, one creates a YAML file (stagger 2.0 format) to specify the paths with all configurations. For EVERY CHANGE the following is what I do (based on the documentation (https://cloud.google.com/endpoints/docs/openapi/get-started-cloud-functions)
Step 1: Deploying the Endpoints configuration
gcloud endpoints services deploy openapi-functions.yaml \
--project ESP_PROJECT_ID
This gives me the following output:
Service Configuration [CONFIG_ID] uploaded for service [CLOUD_RUN_HOSTNAME]
Then,
Step 2: Download the script to local machine
chmod +x gcloud_build_image
./gcloud_build_image -s CLOUD_RUN_HOSTNAME \
-c CONFIG_ID -p ESP_PROJECT_ID
Then,
Step 3: Re deploy the Cloud Run service
gcloud run deploy CLOUD_RUN_SERVICE_NAME \
--image="gcr.io/ESP_PROJECT_ID/endpoints-runtime-serverless:CLOUD_RUN_HOSTNAME-CONFIG_ID" \
--allow-unauthenticated \
--platform managed \
--project=ESP_PROJECT_ID
Is this the process for every API path change? Or is there a simpler direct method of updating the YAML file and uploading it somewhere?
Thanks.
Based on the documentation, yes, this would be the process for every API path change. However, this may change in the future as this feature is currently on beta as stated on the documentation you shared.
You may want to look over here in order to create a feature request to GCP so they can improve this feature in the future.
In the meantime, I could advise to create a script for this process as it is always the same steps and doing something in bash that runs these commands would help you automatize the task.
Hope you find this useful.
When you use the default Cloud Endpoint image as described in the documentation the parameter --rollout_strategy=managed is automatically set.
You have to wait up to 1 minutes to use the new configuration. Personally it's what I observe in my deployments. Have a try on it!

How to Activate Dataflow Shuffle Service through gcloud CLI

I am trying to activate the Dataflow Shuffle [DS] through gcloud command line interface.
I am using this command:
gcloud dataflow jobs run ${JOB_NAME_STANDARD} \
--project=${PROJECT_ID} \
--region=us-east1 \
--service-account-email=${SERVICE_ACCOUNT} \
--gcs-location=${TEMPLATE_PATH}/template \
--staging-location=${PIPELINE_FOLDER}/staging \
--parameters "experiments=[shuffle_mode=\"service\"]"
The job starts. The Dataflow UI reflects it:
However, the logs showing the error with parsing the value:
Failed to parse SDK pipeline options: json: cannot unmarshal string into Go struct
field sdkPipelineOptions.experiments of type []string
What am I doing wrong?
This question is indeed related to an existing question:
How to activate Dataflow Shuffle service?
however the original question was covering python API, while my problem is with gcloud CLI.
[DS] https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#cloud-dataflow-shuffle
P.S. Update
I have also tried:
No luck.
There's currently no way (I know of) to enable shuffle_service for template.
You have two options:
a) Run a job not from template
b) create a template that already has shuffle_service enabled.
The unmarshalling issue is most likely because templates only support fixed amount of parameters and template does not support "experiments" parameter.
----UPD----
I was asked on how to create template with shuffle_service enabled.
Here are sample steps I took.
Follow WordCountTutorial to create project with pipeline definition.
Created template with following command:
mvn -Pdataflow-runner compile exec:java -Dexec.mainClass=org.apache.beam.examples.WindowedWordCount -Dexec.args="--project={project-name} --stagingLocation=gs://{staging-location} --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://{output-location} --runner=DataflowRunner --experiments=shuffle_mode=service --region=us-central1 --templateLocation=gs://{resulting-template-location}"
Note --experiments=shuffle_mode=service argument
Invoked template from UI or via command:
cloud dataflow jobs run {job-name} --project={project-name} --region=us-central1 --gcs-location=gs://{resulting-template-location}

Getting Dataflowrunner with --experiments=upload_graph to work

I have a pipeline that produces a dataflow graph (serialized JSON representation) that exceeds the allowable limit for the API, and thus cannot be launched via the dataflow runner for apache beam as one would normally do. And running dataflow runner with the instructed parameter --experiments=upload_graph does not work and fails saying there are no steps specified .
When getting notified about this size problem via an error, the following information is provided:
the size of the serialized JSON representation of the pipeline exceeds the allowable limit for the API.
Use experiment 'upload_graph' (--experiments=upload_graph)
to direct the runner to upload the JSON to your
GCS staging bucket instead of embedding in the API request.
Now using this parameter, does indeed result in dataflow runner uploading an additional dataflow_graph.pb file to the staging location beside the usual pipeline.pb file. Which I verified actually exists in gcp storage.
However the job in gcp dataflow then immediately fails after start with the following error:
Runnable workflow has no steps specified.
I've tried this flag with various pipelines, even apache beam example pipelines and see the same behaviour.
This can be reproduced by using word count example:
mvn archetype:generate \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
-DarchetypeVersion=2.11.0 \
-DgroupId=org.example \
-DartifactId=word-count-beam \
-Dversion="0.1" \
-Dpackage=org.apache.beam.examples \
-DinteractiveMode=false
cd word-count-beam/
Running it without the experiments=upload_graph parameter works:
(make sure to specify your project, and buckets if you want to run this)
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
--gcpTempLocation=gs://<your-gcs-bucket>/tmp \
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
-Pdataflow-runner
Running it with the experiments=upload_graph results in pipe failing with message workflow has no steps specified
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
--gcpTempLocation=gs://<your-gcs-bucket>/tmp \
--experiments=upload_graph \
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
-Pdataflow-runner
Now I would expect that dataflow runner would direct gcp dataflow to read the steps from the bucket specified as seen in the source code:
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L881
However this seems not to be the case. Has anyone gotten this to work, or has found some documentation regarding this feature that can point me in the right direction?
The experiment has since been reverted and the messaging will be corrected in Beam 2.13.0
Revert PR
I recently ran into this issue and the solution was quite silly. I had quite a complex dataflow streaming job developed and it was working fine and the next day stopped working with error "Runnable workflow has no steps specified.". In my case, someone specified pipeline().run().waitUntilFinish() twice after creating options and due to that, I was getting this error. Removing the duplicate pipeline run resolved the issue. I still think there should be some useful error trace by beam/dataflowrunner in this scenario.

Running Google Cloud ML training job but getting no stdout output in logs

I've built a trainer and when I submit the job, the job starts and logs get populated. But none of my output to stdout ever appears in the log. I do get messages like "The TensorFlow library wasns't compiled to use AVX2 instructions..."
The entire job takes about 5 to 10 minutes on my laptop; I let it run for over an hour on the cloud server and still never saw any output (and the first line of output occurs almost immediately when I run it locally.)
I can run my job locally by invoking it directly, but I haven't been able to get it to run using the "gcloud local" command... when I do this, I get an error "No module named tensorflow"
The log message "The TensorFlow library wasn't compiled to use AVX2 instructions" indicates that log messages are flowing from TensorFlow to Cloud Logging. So most likely there is a problem with the way you have configured logging and as a result log messages aren't being correctly written to stderr/stdout.
This easiest way to debug this would be to create a simple example to try to reproduce this error.
I'd suggest creating a simply python program that does nothing but log a message and then submitting that to the service to see if a log message is printed.
Something like the following
import logging
import time
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
# Output logs for 5 minutes. We do this for 5 minutes just to ensure
# the job doesn't terminate before logs can be flushed.
for i in range(30):
logging.info("This is an info message.")
logging.error("This is an error message.")
time.sleep(10)
For the issue importing TensorFlow when running locally please take a look at this SO Question which has some suggestions on how to check the Python path used by gcloud and verifying that it includes TensorFlow.

Running Google Dataflow locally for Image Recognition

I am currently following this tutorial for transfer learning using tensorflow and Google Cloud Platform.
https://cloud.google.com/blog/big-data/2016/12/how-to-train-and-classify-images-using-google-cloud-machine-learning-and-cloud-dataflow
It works perfectly on the cloud with my own data when I use their sample code
# Preprocess the eval set.
python trainer/preprocess.py \
--input_dict "$DICT_FILE" \
--input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" \
--output_path "${GCS_PATH}/preproc/eval" \
--cloud
I get all the preprocessing, training and deployment done.
However, I would like to be able to run it locally so that I can make changes in the code and debug it more effieciently:
In the code it states that
To run this pipeline locally run the above command without --cloud.
So it would read:
# Preprocess the eval set.
python trainer/preprocess.py \
--input_dict "$DICT_FILE" \
--input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" \
--output_path "${GCS_PATH}/preproc/eval"
I tried to run this code with input_dict, input_path and output_path set to a cloud storage path, as well as being a path to a file on my local machine.
However I get the error:
tensorflow/core/platform/cloud/google_auth_provider.cc:151] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Unavailable: libcurl failed with error code 23: Failed writing body (91 != 196)". Retrieving token from GCE failed with "Unavailable: Unexpected response code 0".
So it seems to be an authentication issue:
The thing is that I do not have authentication problems when copying files from google cloud storage manually.
I already tried:
$ gcloud auth application-default login
but it doesn't change anything.
Does anyone have a solution for that?

Resources