Training locally with ML Engine & GCloud - google-cloud-ml-engine

I would like to train my model locally using this command:
gcloud ml-engine local train
--module-name cloud_runner
--job-dir ./tmp/output
The issue is that it complains that --job-dir: Must be of form gs://bucket/object.
This is a local train so I'm wondering why it wants the output to be a gs storage bucket rather than a local directory.

As explained by other gcloud --job-dir expects the location to be in GCS. To go around that you can pass it as a folder directly to your module.
gcloud ml-engine local train \
--package-path trainer \
--module-name trainer.task \
-- \
--train-files $TRAIN_FILE \
--eval-files $EVAL_FILE \
--job-dir $JOB_DIR \
--train-steps $TRAIN_STEPS

The --package-path argument to the gcloud command should point to a directory that is a valid Python package, i.e., a directory that contains an init.py file (often an empty file). Note that it should be a local directory, not one on GCS.
The --module argument will be the fully qualified name of a valid Python module within that package. You can organize your directories however you want, but for the sake of consistency, the samples all have a Python package named trainer with the module to be run named task.py.
-- Source
So you need to change this block with valid path:
gcloud ml-engine local train
--module-name cloud_runner
--job-dir ./tmp/output
Specifically, your error is due to --job-dir ./tmp/output because it is expecting a path on your gcloud

Local training tries to emulate what happens when you run using the Cloud because the point of local training is to detect problems before submitting your job to the service.
Using a local job-dir when using the CMLE service is an error because the output wouldn't persist after the job finishes.
So local training with gcloud also requires that job-dir be a GCS location.
If you want to run locally and not use GCS you can just run your TensorFlow program directly and not use gcloud.

Related

Avoiding duplicated arguments when running a Docker container

I have a tensorflow training script which I want to run using a Docker container (based on the official TF GPU image). Although everything works just fine, running the container with the script is horribly verbose and ugly. The main problem is that my training script allows the user to specify various directories used during training, for input data, logging, generating output, etc. I don't want to have change what my users are used to, so the container needs to be informed of the location of these user-defined directories, so it can mount them. So I end up with something like this:
docker run \
-it --rm --gpus all -d \
--mount type=bind,source=/home/guest/datasets/my-dataset,target=/datasets/my-dataset \
--mount type=bind,source=/home/guest/my-scripts/config.json,target=/config.json \
-v /home/guest/my-scripts/logdir:/logdir \
-v /home/guest/my-scripts/generated:/generated \
train-image \
python train.py \
--data_dir /datasets/my-dataset \
--gpu 0 \
--logdir ./logdir \
--output ./generated \
--config_file ./config.json \
--num_epochs 250 \
--batch_size 128 \
--checkpoint_every 5 \
--generate True \
--resume False
In the above I am mounting a dataset from the host into the container, and also mounting a single config file config.json (which configures the TF model). I specify a logging directory logdir and an output directory generated as volumes. Each of these resources are also passed as parameters to the train.py script.
This is all very ugly, but I can't see another way of doing it. Of course I could put all this in a shell script, and provide command line arguments which set these duplicated values from the outside. But this doesn't seem a nice solution, because if I want to anything else with the container, for example check the logs, I would use the raw docker command.
I suspect this question will likely be tagged as opinion-based, but I've not found a good solution for this that I can recommend to my users.
As user Ron van der Heijden points out, one solution is to use docker-compose in combination with environment variables defined in an .env file. Nice answer.

Error: "argument --job-dir: expected one argument" while training model using AI Platform on GCP

Running macOS Mojave.
I am following the official getting started documentation to run a model using AI platform.
So far I managed to train my model locally using:
# This is similar to `python -m trainer.task --job-dir local-training-output`
# but it better replicates the AI Platform environment, especially
# for distributed training (not applicable here).
gcloud ai-platform local train \
--package-path trainer \
--module-name trainer.task \
--job-dir local-training-output
I then proceed to train the model using AI platform by going through the following steps:
Setting environment variables export JOB_NAME="my_first_keras_job" and export JOB_DIR="gs://$BUCKET_NAME/keras-job-dir".
Run the following command to package the trainer/ directory:
Command as indicated in docs:
gcloud ai-platform jobs submit training $JOB_NAME \
--package-path trainer/ \
--module-name trainer.task \
--region $REGION \
--python-version 3.5 \
--runtime-version 1.13 \
--job-dir $JOB_DIR \
--stream-logs
I get the error:
ERROR: (gcloud.ai-platform.jobs.submit.training) argument --job-dir:
expected one argument Usage: gcloud ai-platform jobs submit training
JOB [optional flags] [-- USER_ARGS ...] optional flags may be
--async | --config | --help | --job-dir | --labels | ...
As far as I understand --job-dir: does indeed have one argument.
I am not sure what I'm doing wrong. I am running the above command from the trainer/ directory as is shown in the documentation. I tried removing all spaces as described here but the error persists.
Are you running this command locally? Or on a AI notebook VM in Jupyter? Based on your details I assume youre running it locally, I am not a mac expert, but hopefully this is helpful.
I just worked through the same error on an AI notebook VM and my issue was that even though I assigned it a value in a previous Jupyter cell, the $JOB_NAME variable was passing along an empty string in the gcloud command. Try running the following to make sure your code is actually passing a value for $JOB_DIR when you are making the gcloud ai-platform call.
echo $JOB_DIR

Cloud Composer missing variables file

I've been trying to import a JSON file of environment variables to a newly created Cloud Composer instance using the airflow CLI but when running the below I get the error: Missing variables file.
gcloud composer environments run ${COMPOSER_NAME} \
--location=${COMPOSER_LOCATION} \
variables -- \
-i ${VARIABLES_JSON}
From looking at the source it seems that this happens when an environment variable file doesn't exist at the expected location. Is this because Cloud Composer sets up its variables in a different location so this CLI won't work? I've noticed that there's a env_var.json file that's created on the instance's GCS bucket, I realise I can overwrite this file but that doesn't seem like best practice.
It feels like a hack but I copied over the variables.json to my Composer's GCS bucket data folder and then it worked.
This is due to os.path.exists() checking the container that Airflow is running on. I chose this approach over overwriting env_var.json because I get the variables in Airflow's UI with this method.
Script for anyone interested:
COMPOSER_DATA_FOLDER=/home/airflow/gcs/data
COMPOSER_GCS_BUCKET=$(gcloud composer environments describe ${COMPOSER_NAME} --location ${COMPOSER_LOCATION} | grep 'dagGcsPrefix' | grep -Eo "\S+/")
gsutil cp ${ENV_VARIABLES_JSON_FILE} ${COMPOSER_GCS_BUCKET}data
gcloud composer environments run ${COMPOSER_NAME} \
--location ${COMPOSER_LOCATION} variables -- \
-i ${COMPOSER_DATA_FOLDER}/variables.json

How to verify if the content of two Docker images is exactly the same?

How can we determine that two Docker images have exactly the same file system structure, and that the content of corresponding files is the same, irrespective of file timestamps?
I tried the image IDs but they differ when building from the same Dockerfile and a clean local repository. I did this test by building one image, cleaning the local repository, then touching one of the files to change its modification date, then building the second image, and their image IDs do not match. I used Docker 17.06 (the latest version I believe).
If you want to compare content of images you can use docker inspect <imageName> command and you can look at section RootFS
docker inspect redis
"RootFS": {
"Type": "layers",
"Layers": [
"sha256:eda7136a91b7b4ba57aee64509b42bda59e630afcb2b63482d1b3341bf6e2bbb",
"sha256:c4c228cb4e20c84a0e268dda4ba36eea3c3b1e34c239126b6ee63de430720635",
"sha256:e7ec07c2297f9507eeaccc02b0148dae0a3a473adec4ab8ec1cbaacde62928d9",
"sha256:38e87cc81b6bed0c57f650d88ed8939aa71140b289a183ae158f1fa8e0de3ca8",
"sha256:d0f537e75fa6bdad0df5f844c7854dc8f6631ff292eb53dc41e897bc453c3f11",
"sha256:28caa9731d5da4265bad76fc67e6be12dfb2f5598c95a0c0d284a9a2443932bc"
]
}
if all layers are identical then images contains identical content
After some research I came up with a solution which is fast and clean per my tests.
The overall solution is this:
Create a container for your image via docker create ...
Export its entire file system to a tar archive via docker export ...
Pipe the archive directory names, symlink names, symlink contents, file names, and file contents, to an hash function (e.g., MD5)
Compare the hashes of different images to verify if their contents are equal or not
And that's it.
Technically, this can be done as follows:
1) Create file md5docker, and give it execution rights, e.g., chmod +x md5docker:
#!/bin/sh
dir=$(dirname "$0")
docker create $1 | { read cid; docker export $cid | $dir/tarcat | md5; docker rm $cid > /dev/null; }
2) Create file tarcat, and give it execution rights, e.g., chmod +x tarcat:
#!/usr/bin/env python3
# coding=utf-8
if __name__ == '__main__':
import sys
import tarfile
with tarfile.open(fileobj=sys.stdin.buffer, mode="r|*") as tar:
for tarinfo in tar:
if tarinfo.isfile():
print(tarinfo.name, flush=True)
with tar.extractfile(tarinfo) as file:
sys.stdout.buffer.write(file.read())
elif tarinfo.isdir():
print(tarinfo.name, flush=True)
elif tarinfo.issym() or tarinfo.islnk():
print(tarinfo.name, flush=True)
print(tarinfo.linkname, flush=True)
else:
print("\33[0;31mIGNORING:\33[0m ", tarinfo.name, file=sys.stderr)
3) Now invoke ./md5docker <image>, where <image> is your image name or id, to compute an MD5 hash of the entire file system of your image.
To verify if two images have the same contents just check that their hashes are equal as computed in step 3).
Note that this solution only considers as content directory structure, regular file contents, and symlinks (soft and hard). If you need more just change the tarcat script by adding more elif clauses testing for the content you wish to include (see Python's tarfile, and look for methods TarInfo.isXXX() corresponding to the needed content).
The only limitation I see in this solution is its dependency on Python (I am using Python3, but it should be very easy to adapt to Python2). A better solution without any dependency, and probably faster (hey, this is already very fast), is to write the tarcat script in a language supporting static linking so that a standalone executable file was enough (i.e., one not requiring any external dependencies, but the sole OS). I leave this as a future exercise in C, Rust, OCaml, Haskell, you choose.
Note, if MD5 does not suit your needs, just replace md5 inside the first script with your hash utility.
Hope this helps anyone reading.
Amazes me that docker doesn't do this sort of thing out of the box. Here's a variant on #mljrg's technique:
#!/bin/sh
docker create $1 | {
read cid
docker export $cid | tar Oxv 2>&1 | shasum -a 256
docker rm $cid > /dev/null
}
It's shorter, doesn't need a python dependency or a second script at all, I'm sure there are downsides but it seems to work for me with the few tests I've done.
There doesn't seem to be a standard way for doing this. The best way that I can think of is using the Docker multistage build feature.
For example, here I am comparing the apline and debian images. In yourm case set the image names to the ones you want to compare
I basically copy all the file from each image into a git repository and commit after each copy.
FROM alpine as image1
FROM debian as image2
FROM ubuntu
RUN apt-get update && apt-get install -y git
RUN git config --global user.email "you#example.com" &&\
git config --global user.name "Your Name"
RUN mkdir images
WORKDIR images
RUN git init
COPY --from=image1 / .
RUN git add . && git commit -m "image1"
COPY --from=image2 / .
RUN git add . && git commit -m "image2"
CMD tail > /dev/null
This will give you an image with a git repository that records the differences between the two images.
docker build -t compare .
docker run -it compare bash
Now if you do a git log you can see the logs and you can compare the two commits using git diff <commit1> <commit2>
Note: If the image building fails at the second commit, this means that the images are identical, since a git commit will fail if there are no changes to commit.
If we rebuild the Dockerfile it is almost certainly going to produce a new hash.
The only way to create an image with the same hash is to use docker save and docker load. See https://docs.docker.com/engine/reference/commandline/save/
We could then use Bukharov Sergey's answer (i.e. docker inspect) to inspect the layers, looking at the section with key 'RootFS'.

Creates package but no export

My job completes with no error. The logs show "accuracy", "auc", and other statistical measures of my model. ML-engine creates a package subdirectory, and a tar under that, as expected. But, there's no export directory, checkpoint, eval, graph or any other artifact that I'm accustom to seeing when I train locally. Am I missing something simple with the command I'm using to call the service?
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.0 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
-- \
--model_type wide \
--train_data $TRAIN_DATA \
--test_data $TEST_DATA \
--train_steps 1000 \
--verbose-logging true
The logs show this: model directory = /tmp/tmpS7Z2bq
But I was expecting my model to go to the GCS bucket I defined in $OUTPUT_PATH.
I'm following the steps under "Run a single-instance trainer in the cloud" from the getting started docs.
Maybe you could show where and how you declare the $OUTPUT_PATH?
Also the model directory, might be the directory within the $OUTPUT_PATH where you could find the model of that specific Job.

Resources