How does AI Platform (ML Engine) allocate resources to jobs? - machine-learning

I'm trying out a few experiments using Google's AI Platform and have a few questions regarding that.
Basically, my project is structured as per the docs with a trainer task and a separate batch prediction task. I want to understand how AI Platform allocates resources to the tasks I execute. Comparing it with the current SOTA solutions like Spark, Tensorflow and Pytorch is where my doubts arise.
These engines/ libraries have distributed workers with dedicated coordination systems and have separate distributed implementation of all the machine learning algorithms. Since my tasks are written using ScikitLearn, how do these computations parallellize across the cluster that is provisioned by AI Platform since sklearn doesn't have any such distributed computing capabilities?
Following the docs here. The command I'm using,
gcloud ai-platform jobs submit training $JOB_NAME \
--job-dir $JOB_DIR \
--package-path $TRAINING_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--region $REGION \
--runtime-version=$RUNTIME_VERSION \
--python-version=$PYTHON_VERSION \
--scale-tier $SCALE_TIER
Any help/ clarifications would be appreciated!

Alas, AI Platform Training can't automatically distribute your scikit-learn tasks. It basically just sets up the cluster, deploys your package to each node, and runs it.
You might want to try a distributed backend such as Dask for scaling out the task -- it has a drop-in replacement for Joblib that can run scikit-learn pipelines on a cluster.
I found one tutorial here: https://matthewrocklin.com/blog/work/2017/02/07/dask-sklearn-simple
Hope that helps!

Related

Google cloud jobs submit training gets stuck

Hello while I had set up google cloud machine learning to train a neural network , suddenly I am unable to submit jobs to google cloud.
There is no error but the command hangs there without doing anything , Also my instance is running .Here is the command:
gcloud ml-engine jobs submit training job9123 --runtime-version 1.0 --job-dir gs://dataset1_giorgaros2 --package-path trainmodule --module-name trainmodule.nncloud --region europe-west1 --config cloudml-gpu.yaml -- --train-file gs://dataset1_giorgaros2/nnn.p
Thank You !
ML engine job logs could help to obtain more details about the failed job execution, in most of the cases the log file contains the cause for the failure.
Finding the job logs on ML engine
If you are trying the same command each time over the training job execution, you might be obtaining an error regarding to the job name, this due to the name must be unique for each job on ML engine as it is described over the naming convention rules on ML engine jobs.
ML Engine name convention
Try checking network connectivity to google compute engine.
Check logs from the run - https://console.cloud.google.com/
And of course, read the docs:
https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/submit/training

TensorFlow Docker Images

When using the general TensorFlow docker images, they won't be optimized for the exact target architecture.
a) Are there studies for the performance penalty for using these general docker images vs. compiling for the specific architecture?
b) When using a orchestration system such as KubeFlow/Mesos across a heterogeneous cluster, what are best practices for mapping nodes to the optimized TensorFlow compilation (e.g., installing it on each node, having multiple docker images....).
Thanks for your feedback!
For the performance, you can have a look at breandangregg container performance analysis
It's quite good, due to dockerizing is more similar to doing chroot than virtualization, because containers share some kernel functions.
Best practice for heterogeneous cluster is to have a set of images.
For each image, you can run containers with different configuration passing environment variables.
If configuration is going to be the same, you can use autoscaling function of kubernetes, for example.

Google cloud platform setup ERROR: (gcloud.beta.ml) Invalid choice: 'init-project'

I am using cloud shell in google cloud platform. I am trying to getting things installed for machine learning. The codes that I have used so far are
curl https://storage.googleapis.com/cloud-ml/scripts/setup_cloud_shell.sh | bash
export PATH=${HOME}/.local/bin:${PATH}
curl https://storage.googleapis.com/cloud-ml/scripts/check_environment.py | python
gcloud beta ml init-project
It works fine in the first three lines but for the last command, I get
ERROR: (gcloud.beta.ml) Invalid choice: 'init-project'.
Usage: gcloud beta ml [optional flags] <group>
group may be language | speech | video | vision
For detailed information on this command and its flags, run:
gcloud beta ml --help
this error for the last gcloud~ line. Anyone knows what I can do to solve this problem?
Thank you.
First off, let me note that you don't need to run the BETA command as the gcloud ml variant is also available.
As the error message indicates, 'init-project' is not a valid choice, you should instead use one of the following groups: language, speech, video, vision, each of which allows you to make calls to the corresponding API. For instance, you could run the following:
$gcloud ml vision detect-faces IMAGE_PATH
and detect faces within the indicated image.
That said, from your comments it appears that you are not interested in any of the above. If you are looking to train your own TensorFlow models on google cloud platform, you should take a look at the docs relating to Cloud ML Engine. The page that dsesto pointed you to is a good start. I would advise that you also try out the examples in this github repository, particularly the census one. Once there, you'll also see that the gcloud command group used for training models on the cloud (as well as deploying them and using them for prediction jobs) is actually gcloud ml-engine, not gcloud ml.

Google cloud ML slow cpu operations

I am training a Tensorflow model with, unfortunately, many CPU operations. On my local machine, I compiled Tensorflow from source with support for SSE4.2/AVX/FMA to make training run faster. When I train on gcloud via their ML engine service, I am getting 10x slow down compared to local. I suspect that Tensorflow on gcloud ML engine wasn't compile with CPU optimizations. I was wondering if what are ways around this.

Tensorflow with GPU on Google Cloud

I have a model on Google Machine Learning using tensorflow, and it's ok.
Now I want to do some predicts using the GPU.
I saw this link, but it tells about trainning with GPU not prediction. There's nothing about GPU in prediction session.
Someone Know if its possible to do prediction using google machine learning engine with GPU? Or if I use the trainning with GPU, my Prediction automatically run with GPU?
I'm using the follow commandline:
gcloud ml-engine predict --model ${MODEL_NAME} --json-instances request.json
This command works, but It's using CPU.
Additional information: My model is published in us-east1 zone, and my scale is automatically.
You cannot choose to use the GPU for prediction in ml-engine. It's unclear whether they are using GPUs by default -- I would link to documentation but there is none available.
I am sure, however, that they are not using TPUs. Currently, Google is only using TPUs for internal services; although they have created a TPU cloud exclusively for researchers to experiment with: https://yourstory.com/2017/05/google-cloud-tpus-machine-learning-models-training/
If you want more control over how your prediction is run, for the same price you can configure a Google Compute Engine with a high-powered Tesla K80 GPU. Your Tensorflow model will work there, too, and it is straightforward to set up.
My suggestion would be to make benchmark predictions using your GCE instance and then compare them to the ml-engine. If ml-engine is faster than GCE, then Google is probably using GPUs for prediction. Surely, their goal is to provide GPUs and TPUs as an ml-engine in the future, but demand is overloading the HPC cloud these days.
Online predictions on GCP ML Engine use Single core CPUs by default which have high latency. If it suits your requirement, you can use the Quad core CPU which serve predictions faster. For using that, you must specify the type of CPU for predictions which creating a version of your model on ML Engine. Link to the documentation : https://cloud.google.com/ml-engine/docs/tensorflow/online-predict.
We support GPU now. Documentation here!
Example:
gcloud beta ai-platform versions create version_name \
--model model_name \
--origin gs://model-directory-uri \
--runtime-version 2.1 \
--python-version 3.7 \
--framework tensorflow \
--machine-type n1-standard-4 \
--accelerator count=1,type=nvidia-tesla-t4 \
--config config.yaml
If you use one of the Compute Engine (N1) machine types for your model version, you can optionally add GPUs to accelerate each prediction node.
NVIDIA Tesla K80
NVIDIA Tesla P4
NVIDIA Tesla P100
NVIDIA Tesla T4
NVIDIA Tesla V100
This site has some information:
https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction
However, it is a completely different way to train and predict. They provide the means for training and predictiong on their service infrastructure. You just build the model with your Tensorflow program and then use their hardware with their cloud SDK. So it shouldn't bother you whether it runs on CPU or GPU.

Resources