How does one write a Cluster Spec for Distributed YoutTube-8m challenge training? - youtube

Can someone please post a ClusterSpec for distributed training of the models defined in the YouTube-8m Challenge code?
The code tries to load a cluster spec from TF_CONFIG environment variable. However, I'm not sure what the value for TF_CONFIG should be. I have access to 2 GPUs on one machine and just want to run the model with data-level parallelism.

If you want to run YouTube 8m challenge code in a distributed manner, you have to write a yaml file (There is an example yaml file provided by Google) and then you need to pass as parameter where is located this yaml file. TF_CONFIG makes reference to the configuration variables used to train the model.
For example, for running on google cloud the starting code in a distributed manner, I have used:
JOB_NAME=yt8m_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
submit training $JOB_NAME \
--package-path=youtube-8m --module-name=youtube-8m.train \
--staging-bucket=$BUCKET_NAME --region=us-east1 \
--config=youtube-8m/cloudml-gpu-distributed.yaml \
-- --train_data_pattern='gs://youtube8m-ml-us-east1/1/frame_level/train/train*.tfrecord' \
--frame_features=True --model=LstmModel --feature_names="rgb,audio" \
--feature_sizes="1024, 128" --batch_size=128 \
--train_dir=$BUCKET_NAME/${JOB_TO_EVAL}
The parameter config is pointing to the yaml file cloudml-gpu-distributed.yaml with the following specification:
trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 2
workerType: standard_gpu
parameterServerCount: 2
parameterServerType: standard

Related

Google Endpoints YAML file update: Is there a simpler method

When using Google Endpoints with Cloud Run to provide the container service, one creates a YAML file (stagger 2.0 format) to specify the paths with all configurations. For EVERY CHANGE the following is what I do (based on the documentation (https://cloud.google.com/endpoints/docs/openapi/get-started-cloud-functions)
Step 1: Deploying the Endpoints configuration
gcloud endpoints services deploy openapi-functions.yaml \
--project ESP_PROJECT_ID
This gives me the following output:
Service Configuration [CONFIG_ID] uploaded for service [CLOUD_RUN_HOSTNAME]
Then,
Step 2: Download the script to local machine
chmod +x gcloud_build_image
./gcloud_build_image -s CLOUD_RUN_HOSTNAME \
-c CONFIG_ID -p ESP_PROJECT_ID
Then,
Step 3: Re deploy the Cloud Run service
gcloud run deploy CLOUD_RUN_SERVICE_NAME \
--image="gcr.io/ESP_PROJECT_ID/endpoints-runtime-serverless:CLOUD_RUN_HOSTNAME-CONFIG_ID" \
--allow-unauthenticated \
--platform managed \
--project=ESP_PROJECT_ID
Is this the process for every API path change? Or is there a simpler direct method of updating the YAML file and uploading it somewhere?
Thanks.
Based on the documentation, yes, this would be the process for every API path change. However, this may change in the future as this feature is currently on beta as stated on the documentation you shared.
You may want to look over here in order to create a feature request to GCP so they can improve this feature in the future.
In the meantime, I could advise to create a script for this process as it is always the same steps and doing something in bash that runs these commands would help you automatize the task.
Hope you find this useful.
When you use the default Cloud Endpoint image as described in the documentation the parameter --rollout_strategy=managed is automatically set.
You have to wait up to 1 minutes to use the new configuration. Personally it's what I observe in my deployments. Have a try on it!

Google Dataflow creates only one worker for large .bz2 file

I am trying to process the Wikidata json dump using Cloud Dataflow.
I have downloaded the file from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 and hosted it into a GS bucket. It's a large (50G) .bz2 file containing a list of json dicts (one per line).
I understand that apache_beam.io.ReadFromText can handle .bz2 (I tested that on toy datasets) and that .bz2 is splittable. Therefore I was hoping that multiple workers would be created that would work in parallel on different blocks of that unique file (I'm not totally clear if/how blocks would res.
Ultimately I want to do some analytics on each line (each json dict) but as a test for ingestion I am just using the project's wordcount.py:
python -m apache_beam.examples.wordcount \
--input gs://MYBUCKET/wikidata/latest-all.json.bz2 \
--output gs://MYBUCKET/wikidata/output/entities-all.json \
--runner DataflowRunner \
--project MYPROJECT \
--temp_location gs://MYBUCKET/tmp/
At startup, autoscaling quickly increases the number of workers 1->6 but only one worker does any work and then autoscaling scales back 6->1 after a couple minutes (jobid: 2018-10-11_00_45_54-9419516948329946918)
If I disable autoscaling and set explicitly the number of workers, then all but one remain idle.
Can parallelism be achieved on this sort of input? Many thanks for any help.
Other than Hadoop, Apache Beam has not yet implemented bzip2 splitting: https://issues.apache.org/jira/browse/BEAM-683

Error while running model training in google cloud ml

I want to run model training in the cloud. I am following this link which runs a sample code to train a model based on flower dataset. The tutorial consists of 4 stages:
Set up your Cloud Storage bucket
Preprocessing training and evaluation data in the cloud
Run model training in the cloud
Deploying and using the model for prediction
I was able to complete step 1 and 2, however in step 3, job is successfully submitted but somehow error occurs and task exits with non exit status 1. Here is the log of the task
Screenshot of expanded log is:
I used following command:
gcloud ml-engine jobs submit training test${JOB_ID} \
--stream-logs \
--module-name trainer.task \
--package-path trainer\
--staging-bucket ${BUCKET_NAME} \
--region us-central1 \
--runtime-version=1.2 \
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*"
Thanks in advance!
Can you please confirm that the input files (eval_data_paths and train_data_paths) are not empty? Additionally if you are still having issues can you please file an issue https://github.com/GoogleCloudPlatform/cloudml-samples since its easier to handle the issue on Github.
I met the same issue and couldn't figure out, then I followed this, do it again from git clone and there was no error after running on gcs.
It is clear from your error message
The replica worker 1 exited with a non-zero status of 1. Termination reason: Error
that you have some programming error (syntax, undefined etc).
For more information, Check the return code and meaning
Return code -------------Meaning-------------- Cloud ML Engine response
0 Successful completion Shuts down and releases job resources.
1-128 Unrecoverable error Ends the job and logs the error.
Your need to find your bug first and fix it, then try again.
I recommend run your task locally (if your configuration supports) before you submit in cloud. If you find any bug, you can fix easily in your local machine.

Trainer module not found in Google Cloud ML Engine

I am trying to tune my variational autoencoder's hyperparameters using Google Cloud ML Engine. I set up my package with the structure they recommend in the docs, so that I specify "trainer.task" as my main module name. Below is an image of my directory structure.
image of directory structure
This works on my own machine when I include the following lines:
import sys
sys.path.append("/path/to/project/directory/")
When I run using the below command, I get the error "No module named trainer". Is there a different path I need to specify or something special I need to do for running on Google Cloud ML Engine?
gcloud ml-engine jobs submit training $JOB_NAME --package-path $TRAINER_PACKAGE_PATH --module-name $MAIN_TRAINER_MODULE --job-dir $JOB_DIR --region $REGION --config config.yaml
Do you have a setup.py file? If so you might be hitting this issue
To debug this:
Get the GCS location of the package from the job
gcloud --project=$PROJECT ml-engine jobs describe $JOB_NAME
This will output something like
jobId: somejob
state: PREPARING
trainingInput:
jobDir: gs://BUCKET/job
packageUris:
- gs://bucket/job/packages/7d2611c7366f266058da5a9e2c93467426c5fdd018491fa33853516d9db533b1/somepackage-0.0.0.tar.gz
pythonModule: cifar.task
region: us-central1
trainingOutput: {}
Note the values above are for illustrative purposes only and will differ from your output.
Copy the GCS package to your machine
gsutil cp gs://bucket/job/packages/7d2611c7366f266058da5a9e2c93467426c5fdd018491fa33853516d9db533b1/somepackage-0.0.0.tar.gz /tmp
Unpack the .tar.gz and check it has a directory trainer with an __init__.py file and task.py. If not then you probably specified incorrect values for the command line.
If you include the actual command line (i.e. the values for the variables) and the contents of .tar.gz, I can probably provide a better answer.
Jeremy I had a similar problem. I downloaded and unzipped my files but there was no task.py in it.
These are the cmd line arguments I used:
gcloud ml-engine jobs submit training job11 --package-path=./trainer --module-
name='Keras_On_GoogleCloud.trainer.shallownet_train' --job-dir=gs://zubair-gc-
bucket/jobs/job11 --region='us-central1' --config=trainer/cloudml-gpu.yaml -- -
-job_name='zubair-gc-job11' --dataset='dataset/animals' --model='shallownet_weights1.hdf5'

gcloud ml-engine predict is very slow on inference

I'm testing a segmentation model on gcloud and the inference is incredibly slow. It takes 3 min to get the result (averaged over 5 runs). Same model runs ~2.5 s on my laptop when running through tf-serving.
Is it normal? I didn't find any mention in the documentation on how to define the instance type and it seems impossible to run inference on GPU.
The steps I'm using is fairly straightforward and follows the examples and tutorials:
gcloud ml-engine models create "seg_model"
gcloud ml-engine versions create v1 \
--model "seg_model" \
--origin $DEPLOYMENT_SOURCE \
--runtime-version 1.2 \
--staging-bucket gs://$BUCKET_NAME
gcloud ml-engine predict --model ${MODEL_NAME} --version v1 --json-instances request.json
Upd: after running more experiments I found that redirecting output to a file gets the inference time down to 27s. Model output size is 512x512, which probably causes some delays on a client side. Although it is much lower than 3 min, it is still an order of magnitude slower than tf-serving.

Resources