Error: "argument --job-dir: expected one argument" while training model using AI Platform on GCP - machine-learning

Running macOS Mojave.
I am following the official getting started documentation to run a model using AI platform.
So far I managed to train my model locally using:
# This is similar to `python -m trainer.task --job-dir local-training-output`
# but it better replicates the AI Platform environment, especially
# for distributed training (not applicable here).
gcloud ai-platform local train \
--package-path trainer \
--module-name trainer.task \
--job-dir local-training-output
I then proceed to train the model using AI platform by going through the following steps:
Setting environment variables export JOB_NAME="my_first_keras_job" and export JOB_DIR="gs://$BUCKET_NAME/keras-job-dir".
Run the following command to package the trainer/ directory:
Command as indicated in docs:
gcloud ai-platform jobs submit training $JOB_NAME \
--package-path trainer/ \
--module-name trainer.task \
--region $REGION \
--python-version 3.5 \
--runtime-version 1.13 \
--job-dir $JOB_DIR \
--stream-logs
I get the error:
ERROR: (gcloud.ai-platform.jobs.submit.training) argument --job-dir:
expected one argument Usage: gcloud ai-platform jobs submit training
JOB [optional flags] [-- USER_ARGS ...] optional flags may be
--async | --config | --help | --job-dir | --labels | ...
As far as I understand --job-dir: does indeed have one argument.
I am not sure what I'm doing wrong. I am running the above command from the trainer/ directory as is shown in the documentation. I tried removing all spaces as described here but the error persists.

Are you running this command locally? Or on a AI notebook VM in Jupyter? Based on your details I assume youre running it locally, I am not a mac expert, but hopefully this is helpful.
I just worked through the same error on an AI notebook VM and my issue was that even though I assigned it a value in a previous Jupyter cell, the $JOB_NAME variable was passing along an empty string in the gcloud command. Try running the following to make sure your code is actually passing a value for $JOB_DIR when you are making the gcloud ai-platform call.
echo $JOB_DIR

Related

Avoiding duplicated arguments when running a Docker container

I have a tensorflow training script which I want to run using a Docker container (based on the official TF GPU image). Although everything works just fine, running the container with the script is horribly verbose and ugly. The main problem is that my training script allows the user to specify various directories used during training, for input data, logging, generating output, etc. I don't want to have change what my users are used to, so the container needs to be informed of the location of these user-defined directories, so it can mount them. So I end up with something like this:
docker run \
-it --rm --gpus all -d \
--mount type=bind,source=/home/guest/datasets/my-dataset,target=/datasets/my-dataset \
--mount type=bind,source=/home/guest/my-scripts/config.json,target=/config.json \
-v /home/guest/my-scripts/logdir:/logdir \
-v /home/guest/my-scripts/generated:/generated \
train-image \
python train.py \
--data_dir /datasets/my-dataset \
--gpu 0 \
--logdir ./logdir \
--output ./generated \
--config_file ./config.json \
--num_epochs 250 \
--batch_size 128 \
--checkpoint_every 5 \
--generate True \
--resume False
In the above I am mounting a dataset from the host into the container, and also mounting a single config file config.json (which configures the TF model). I specify a logging directory logdir and an output directory generated as volumes. Each of these resources are also passed as parameters to the train.py script.
This is all very ugly, but I can't see another way of doing it. Of course I could put all this in a shell script, and provide command line arguments which set these duplicated values from the outside. But this doesn't seem a nice solution, because if I want to anything else with the container, for example check the logs, I would use the raw docker command.
I suspect this question will likely be tagged as opinion-based, but I've not found a good solution for this that I can recommend to my users.
As user Ron van der Heijden points out, one solution is to use docker-compose in combination with environment variables defined in an .env file. Nice answer.

gcloud run deploy keeps erroring when I add args: '--args: expected one argument'

I am trying to run gcloud run deploy with the following parameters:
gcloud run deploy "$SERVICE_NAME" \
--quiet \
--region "$RUN_REGION" \
--image "gcr.io/$PROJECT_ID/$SERVICE_NAME:$GITHUB_SHA" \
--platform "managed" \
--allow-unauthenticated \
--args "--privileged"
but I keep getting the following error when I add anything to args whatsoever:
ERROR: (gcloud.run.deploy) argument --args: expected one argument
I am obviously using the args parameter incorrectly but for the life of me I can't figure out why. The example in the docs uses it exactly as I have done.
What am I missing?
EDIT:
Even the example from the docs doesn't work, and returns the same error:
gcloud run deploy \
--args "--repo-allowlist=github.com/example/example_demo" \
--args "--gh-webhook-secret=XX" \
So, I finally got it working. I'm not sure why I needed to add an = as that wasn't specified in the docs, but here's the solution:
--args="--privileged"

"unrecognized arguments" error while executing a Dataflow job with gcloud cli

I have created a job in Dataflow UI and it works fine. Now I want to automate it from the command line with a small bash script:
#GLOBAL VARIABLES
export PROJECT="cf-businessintelligence"
export GCS_LOCATION="gs://dataflow-templates/latest/Jdbc_to_BigQuery"
export MAX_WORKERS="15"
export NETWORK="businessintelligence"
export REGION_ID="us-central1"
export STAGING_LOCATION="gs://dataflow_temporary_directory/temp_dir"
export SUBNETWORK="bidw-dataflow-usc1"
export WORKER_MACHINE_TYPE="n1-standard-96"
export ZONE="us-central1-a"
export JOBNAME="test"
#COMMAND
gcloud dataflow jobs run $JOBNAME --project=$PROJECT --gcs-location=$GCS_LOCATION \
--max-workers=$MAX_WORKERS \
--network=$NETWORK \
--parameters ^:^query="select current_date":connectionURL="jdbc:mysql://mysqldbhost:3306/bidw":user="xyz",password="abc":driverClassName="com.mysql.jdbc.Driver":driverJars="gs://jdbc_drivers/mysql-connector-java-8.0.16.jar":outputTable="cf-businessintelligence:bidw.mytest":tempLocation="gs://dataflow_temporary_directory/tmp" \
--region=$REGION_ID \
--staging-location=$STAGING_LOCATION \
--subnetwork=$SUBNETWORK \
--worker-machine-type=$WORKER_MACHINE_TYPE \
--zone=$ZONE
When I run it, it fails with the following error:
ERROR: (gcloud.dataflow.jobs.run) unrecognized arguments:
--network=businessintelligence
Following the instructions in gcloud topic escaping , I believe I correctly escaped my parameters so I am really confused. Why is failing on the NETWORK parameter?
Try getting help for your command, to see which options are currently accepted by it:
gcloud dataflow jobs run --help
For me, this displays a number of options, but not the --network option.
I then checked the beta channel:
gcloud beta dataflow jobs run --help
And it does display the --network option. So you'll want to launch your job with gcloud beta dataflow....
Both the network and subnetwork arguments need to be the complete URL.
Source: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications
Example for the subnetwork flag:
https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION_NAME/subnetworks/SUBNETWORK_NAME

Training locally with ML Engine & GCloud

I would like to train my model locally using this command:
gcloud ml-engine local train
--module-name cloud_runner
--job-dir ./tmp/output
The issue is that it complains that --job-dir: Must be of form gs://bucket/object.
This is a local train so I'm wondering why it wants the output to be a gs storage bucket rather than a local directory.
As explained by other gcloud --job-dir expects the location to be in GCS. To go around that you can pass it as a folder directly to your module.
gcloud ml-engine local train \
--package-path trainer \
--module-name trainer.task \
-- \
--train-files $TRAIN_FILE \
--eval-files $EVAL_FILE \
--job-dir $JOB_DIR \
--train-steps $TRAIN_STEPS
The --package-path argument to the gcloud command should point to a directory that is a valid Python package, i.e., a directory that contains an init.py file (often an empty file). Note that it should be a local directory, not one on GCS.
The --module argument will be the fully qualified name of a valid Python module within that package. You can organize your directories however you want, but for the sake of consistency, the samples all have a Python package named trainer with the module to be run named task.py.
-- Source
So you need to change this block with valid path:
gcloud ml-engine local train
--module-name cloud_runner
--job-dir ./tmp/output
Specifically, your error is due to --job-dir ./tmp/output because it is expecting a path on your gcloud
Local training tries to emulate what happens when you run using the Cloud because the point of local training is to detect problems before submitting your job to the service.
Using a local job-dir when using the CMLE service is an error because the output wouldn't persist after the job finishes.
So local training with gcloud also requires that job-dir be a GCS location.
If you want to run locally and not use GCS you can just run your TensorFlow program directly and not use gcloud.

Creates package but no export

My job completes with no error. The logs show "accuracy", "auc", and other statistical measures of my model. ML-engine creates a package subdirectory, and a tar under that, as expected. But, there's no export directory, checkpoint, eval, graph or any other artifact that I'm accustom to seeing when I train locally. Am I missing something simple with the command I'm using to call the service?
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.0 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
-- \
--model_type wide \
--train_data $TRAIN_DATA \
--test_data $TEST_DATA \
--train_steps 1000 \
--verbose-logging true
The logs show this: model directory = /tmp/tmpS7Z2bq
But I was expecting my model to go to the GCS bucket I defined in $OUTPUT_PATH.
I'm following the steps under "Run a single-instance trainer in the cloud" from the getting started docs.
Maybe you could show where and how you declare the $OUTPUT_PATH?
Also the model directory, might be the directory within the $OUTPUT_PATH where you could find the model of that specific Job.

Resources