Trainer module not found in Google Cloud ML Engine - machine-learning

I am trying to tune my variational autoencoder's hyperparameters using Google Cloud ML Engine. I set up my package with the structure they recommend in the docs, so that I specify "trainer.task" as my main module name. Below is an image of my directory structure.
image of directory structure
This works on my own machine when I include the following lines:
import sys
sys.path.append("/path/to/project/directory/")
When I run using the below command, I get the error "No module named trainer". Is there a different path I need to specify or something special I need to do for running on Google Cloud ML Engine?
gcloud ml-engine jobs submit training $JOB_NAME --package-path $TRAINER_PACKAGE_PATH --module-name $MAIN_TRAINER_MODULE --job-dir $JOB_DIR --region $REGION --config config.yaml

Do you have a setup.py file? If so you might be hitting this issue
To debug this:
Get the GCS location of the package from the job
gcloud --project=$PROJECT ml-engine jobs describe $JOB_NAME
This will output something like
jobId: somejob
state: PREPARING
trainingInput:
jobDir: gs://BUCKET/job
packageUris:
- gs://bucket/job/packages/7d2611c7366f266058da5a9e2c93467426c5fdd018491fa33853516d9db533b1/somepackage-0.0.0.tar.gz
pythonModule: cifar.task
region: us-central1
trainingOutput: {}
Note the values above are for illustrative purposes only and will differ from your output.
Copy the GCS package to your machine
gsutil cp gs://bucket/job/packages/7d2611c7366f266058da5a9e2c93467426c5fdd018491fa33853516d9db533b1/somepackage-0.0.0.tar.gz /tmp
Unpack the .tar.gz and check it has a directory trainer with an __init__.py file and task.py. If not then you probably specified incorrect values for the command line.
If you include the actual command line (i.e. the values for the variables) and the contents of .tar.gz, I can probably provide a better answer.

Jeremy I had a similar problem. I downloaded and unzipped my files but there was no task.py in it.
These are the cmd line arguments I used:
gcloud ml-engine jobs submit training job11 --package-path=./trainer --module-
name='Keras_On_GoogleCloud.trainer.shallownet_train' --job-dir=gs://zubair-gc-
bucket/jobs/job11 --region='us-central1' --config=trainer/cloudml-gpu.yaml -- -
-job_name='zubair-gc-job11' --dataset='dataset/animals' --model='shallownet_weights1.hdf5'

Related

Conda: how to add packages to environment from log (not yaml)?

I'm doing an internship (= yes I'm a newbie). My supervisor told me to create a conda environment. She passed me a log file containing many packages.
A quick qwant.com search shows me how to create envs via the
conda env create --file env_file.yaml
The file I was give is however NOT a yaml file it is structured like so:
# packages in environment at /home/supervisors_name/.conda/envs/pancancer:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
bedtools 2.29.2 hc088bd4_0 bioconda
blas 1.0 mkl
bzip2 1.0.8 h7b6447c_0
The file contains 41 packages = 44 lines including comments above. For simplicity I'm showing only the first 7.
Appart from adding env name (see 2. below), is there a way to use the file as it is to generate an environment with the packages?
I ran the cmd using
conda env create --file supervisors.log.txt
SpecNotFound: Environment with requirements.txt file needs a name
Where in the file should I put the name?
alright, so, it seems that they give you the output of conda list rather than the .yml file produced by conda with conda env export > myenv.yml. Therefore you have two solutions:
You ask for the proper file and then proceed to install the env with conda built-in pipeline
If you do not have any access on the proper file, you could do one of the following:
i) Parse with python into a proper .yml file and then do the conda procedure.
ii) Do a bash script, downloading the packages listed in the file she gave you.
This is how I would proceed, personally :)
Because there is no other SO post on this error, for people of the future: I got this error just because I named my file conda_environment.txt instead of conda_environment.yml. Looks like the yml extension is mandatory.

How to serve custom MLflow model with Docker?

We have a project following essentially this
docker example with the only difference that we created a custom model similar to this whose code lies in a directory called forecast. We succeeded in running the model with mlflow run. The problem arises when we try to serve the model. After doing
mlflow models build-docker -m "runs:/my-run-id/my-model" -n "my-image-name"
we fail running the container with
docker run -p 5001:8080 "my-image-name"
with the following error:
ModuleNotFoundError: No module named 'forecast'
It seems that the docker image is not aware of the source code defining our custom model class.
With Conda environnement the problem does not arise thanks to the code_path argument in mlflow.pyfunc.log_model.
Our Dockerfile is very basic, with just FROM continuumio/miniconda3:4.7.12, RUN pip install {model_dependencies}.
How to let the docker image know about the source code for deserialising the model and run it?
You can specify source code dependencies by setting
code_paths argument when logging the model. So in your case, you can do something like:
mlflow.pyfunc.log_model(..., code_paths=[<path to your forecast.py file>])

Error when uploading from a Cloud Storage bucket to Google Colab using 'gsutil'

I've trained a model in a Compute Engine VM instance in GCP and copied the weights into a Cloud Storage bucket using the gsutil cp -r command.
I then made the bucket public and tried to copy those weights into a Google Colab notebook using the command !gsutil cp -r gs://{bucket/folder} ./
However, I get the following error:
ResumableDownloadException: Transfer failed after 23 retries. Final
exception: Anonymous caller does not have storage.objects.get access
to {folder/path}
Why am I getting this error?
Edit:
The Cloud Storage bucket is missing the appropriate Cloud IAM role to
make it fully publicly read from. The role
roles/storage.objectViewer
provides the necessary permissions to read and list objects from the
bucket - assigning it to allUsers will make it public.
Therefore, as per the
documentation,
this can be achieved with a single gsutil
iam
command:
gsutil iam ch allUsers:objectViewer gs://[BUCKET_NAME].
And then, in in Google Colab you should be able to read (or download) objects from Cloud Storage buckets with:
!gsutil cp -r gs://[BUCKET_NAME]/[FOLDER_NAME] ./
A safer approach is to instead of making the entire Cloud Storage
bucket public, to authenticate with it using the following Python
code in the notebook:
from google.colab import auth auth.authenticate_user()
Then set the project ID you're using with a gcloud command;
replacing my-project accordingly:
!gcloud config set project my-project
And finally run the gsutil command; replacing bucket and
folder:
!gsutil cp -r gs://[BUCKET_NAME]/[FOLDER_NAME] ./

How to both run an AWS SAM template locally and deploy it to AWS successfully (with Java Lambdas)?

I'm trying to build an AWS application using SAM (Serverless Application Model) with the Lambdas written in Java.
I was able to get it running locally by using a resource definition like this in the template:
Resources:
HelloWorldFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: HelloWorldFunction
Handler: helloworld.App::handleRequest
Runtime: java8
Events:
HelloWorld:
Type: Api
Properties:
Path: /hello
Method: get
But to get the sam package phase to upload only the actual code (and not the whole project directory) to S3 I had to change it to this:
...
Properties:
CodeUri: HelloWorldFunction/target/HelloWorld-1.0.jar
...
as documented in the AWS SAM example project README.
However, this breaks the ability to run the application locally with sam build followed by sam local start-api.
I tried to get around this by giving the CodeUri value as a parameter (with --parameter-overrides) and this works locally but breaks the packaging phase because of a known issue with the SAM translator.
Is there a way to make both the local build and the real AWS deployment working, preferably with the same template file?
The only workaround I've come up myself with so far is to use different template files for local development and actual packaging and deployment.
To avoid maintaining two almost equal template files I wrote a script for running the service locally:
#!/bin/bash
echo "Copying template..."
sed 's/CodeUri: .*/CodeUri: HelloWorldFunction/' template.yaml > template-local.yaml
echo "Building..."
if sam build -t template-local.yaml
then
echo "Serving local API..."
sam local start-api
else
echo "Build failed, not running service."
fi
This feels less than optimal but does the trick. Would love to hear better alternatives, still.
Another idea that came to mind was extending a mutual base template with separate CodeUri values for these cases but I don't think SAM templates support anything like that.

How does one write a Cluster Spec for Distributed YoutTube-8m challenge training?

Can someone please post a ClusterSpec for distributed training of the models defined in the YouTube-8m Challenge code?
The code tries to load a cluster spec from TF_CONFIG environment variable. However, I'm not sure what the value for TF_CONFIG should be. I have access to 2 GPUs on one machine and just want to run the model with data-level parallelism.
If you want to run YouTube 8m challenge code in a distributed manner, you have to write a yaml file (There is an example yaml file provided by Google) and then you need to pass as parameter where is located this yaml file. TF_CONFIG makes reference to the configuration variables used to train the model.
For example, for running on google cloud the starting code in a distributed manner, I have used:
JOB_NAME=yt8m_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
submit training $JOB_NAME \
--package-path=youtube-8m --module-name=youtube-8m.train \
--staging-bucket=$BUCKET_NAME --region=us-east1 \
--config=youtube-8m/cloudml-gpu-distributed.yaml \
-- --train_data_pattern='gs://youtube8m-ml-us-east1/1/frame_level/train/train*.tfrecord' \
--frame_features=True --model=LstmModel --feature_names="rgb,audio" \
--feature_sizes="1024, 128" --batch_size=128 \
--train_dir=$BUCKET_NAME/${JOB_TO_EVAL}
The parameter config is pointing to the yaml file cloudml-gpu-distributed.yaml with the following specification:
trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 2
workerType: standard_gpu
parameterServerCount: 2
parameterServerType: standard

Resources