I am training a Tensorflow model with, unfortunately, many CPU operations. On my local machine, I compiled Tensorflow from source with support for SSE4.2/AVX/FMA to make training run faster. When I train on gcloud via their ML engine service, I am getting 10x slow down compared to local. I suspect that Tensorflow on gcloud ML engine wasn't compile with CPU optimizations. I was wondering if what are ways around this.
Related
As per the documentation I am building and deploying my code to Cloud Run. I have configured the machine it's running on the have 2 CPU cores.
Since Cloud Run manages scaling automatically, will I get additional performance benefits from using the Node cluster module to utilize both CPU cores on the host machine?
If your code can leverage 2 (or more) CPU in the same time to process the same request, using more than 1 CPU makes sense.
If, as the majority of the developers, you use NodeJS as-is, a single-thread runtime, don't set 2 CPU on your cloud run service. Set one, and let Cloud Run scaling automatically the number of parallel instances.
At high level, it's like having a cluster of VM; or a big multi CPU VM with 1 thread per CPU. It's the power of horizontal scaling by keeping the code simple.
I'm trying out a few experiments using Google's AI Platform and have a few questions regarding that.
Basically, my project is structured as per the docs with a trainer task and a separate batch prediction task. I want to understand how AI Platform allocates resources to the tasks I execute. Comparing it with the current SOTA solutions like Spark, Tensorflow and Pytorch is where my doubts arise.
These engines/ libraries have distributed workers with dedicated coordination systems and have separate distributed implementation of all the machine learning algorithms. Since my tasks are written using ScikitLearn, how do these computations parallellize across the cluster that is provisioned by AI Platform since sklearn doesn't have any such distributed computing capabilities?
Following the docs here. The command I'm using,
gcloud ai-platform jobs submit training $JOB_NAME \
--job-dir $JOB_DIR \
--package-path $TRAINING_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--region $REGION \
--runtime-version=$RUNTIME_VERSION \
--python-version=$PYTHON_VERSION \
--scale-tier $SCALE_TIER
Any help/ clarifications would be appreciated!
Alas, AI Platform Training can't automatically distribute your scikit-learn tasks. It basically just sets up the cluster, deploys your package to each node, and runs it.
You might want to try a distributed backend such as Dask for scaling out the task -- it has a drop-in replacement for Joblib that can run scikit-learn pipelines on a cluster.
I found one tutorial here: https://matthewrocklin.com/blog/work/2017/02/07/dask-sklearn-simple
Hope that helps!
When using the general TensorFlow docker images, they won't be optimized for the exact target architecture.
a) Are there studies for the performance penalty for using these general docker images vs. compiling for the specific architecture?
b) When using a orchestration system such as KubeFlow/Mesos across a heterogeneous cluster, what are best practices for mapping nodes to the optimized TensorFlow compilation (e.g., installing it on each node, having multiple docker images....).
Thanks for your feedback!
For the performance, you can have a look at breandangregg container performance analysis
It's quite good, due to dockerizing is more similar to doing chroot than virtualization, because containers share some kernel functions.
Best practice for heterogeneous cluster is to have a set of images.
For each image, you can run containers with different configuration passing environment variables.
If configuration is going to be the same, you can use autoscaling function of kubernetes, for example.
I have used hashicorp packer for building baked VM images.
But was wondering linuxkit too do the same stuff I mean building the baked VM images with the only difference of being more container and kernel centeric.
Want to know the exact difference between the working of these two and there use cases.
Also can there be any usecase using both packer and linuxkit.
I have used both fairly extensively (disclosure: I am a volunteer maintainer for LinuxKit). I used packer for quite some time, and switched almost all of the work I did in packer over to LinuxKit (lkt).
In principle both are open-source tools that serve the same purpose: generate an OS image that can be run. Practically, most use it for VM images to run on vbox, AWS, Azure, GCR, etc., but you can generate an image that will run on bare metal, which I have done as well.
Packer, being older, has a more extensive array of provisioners, builders, plugins, etc. It tries to be fairly broad-based and non-opinionated. Build for everywhere, run any install you want.
LinuxKit runs almost everything - onboot processes and continuous services - in a container. Even the init phase - where the OS image will be booted - is configured by copying files from OCI images.
LinuxKit's strong opinions about how to run and build things can in some ways be restrictive, but also liberating.
The most important differences, in my opinion, are the following:
lkt builds up from scratch to the bare minimum you need; Packet builds from an existing OS base.
lkt's security surface of attack will be smaller, because it starts not with an existing OS, but with, well, nothing.
lkt images can be significantly smaller, because you add in only precisely what you need.
lkt builds run locally. Packer essentially spins up a VM (vbox, EC2, whatever), runs some base image, modifies it per your instructions, and then saves it as a new image. lkt just manipulates OCI images by downloading and copying files to create a new image.
I can get to the same net result for differences 1-3 with Packer and LinuxKit, albeit lkt is much less work. E.g. I contributed the getty package to LinuxKit to separate and control when/how getty is launched, and in which namespace. The amount of work to separate and control that in a packer image built on a full OS would have been much harder. Same for the tpm package. Etc.
The biggest difference IMO, though, is step 4. Because Packer launches a VM and runs commands in it, it is much slower and much harder to debug. The same packer image that takes me 10+ mins to build can be 30 seconds in lkt. Your mileage may vary, depending on if the OCI images are downloaded or not, and how complex what you are doing is, but it really has been an order of magnitude faster for me.
Similarly, debugging step by step, or finding an error, running, debugging, and rebuilding, is far harder in a process that runs in a remote VM than it is in a local command: lkt build.
As I said, opinions are my own, but those are the reasons that I moved almost all of my build work to lkt, contributed, and agreed to join the excellent group of maintainers when asked by the team.
At the same time, I am deeply appreciative to HashiCorp for their fantastic toolset. Packer served me well; nowadays, LinuxKit serves me better.
I have a model on Google Machine Learning using tensorflow, and it's ok.
Now I want to do some predicts using the GPU.
I saw this link, but it tells about trainning with GPU not prediction. There's nothing about GPU in prediction session.
Someone Know if its possible to do prediction using google machine learning engine with GPU? Or if I use the trainning with GPU, my Prediction automatically run with GPU?
I'm using the follow commandline:
gcloud ml-engine predict --model ${MODEL_NAME} --json-instances request.json
This command works, but It's using CPU.
Additional information: My model is published in us-east1 zone, and my scale is automatically.
You cannot choose to use the GPU for prediction in ml-engine. It's unclear whether they are using GPUs by default -- I would link to documentation but there is none available.
I am sure, however, that they are not using TPUs. Currently, Google is only using TPUs for internal services; although they have created a TPU cloud exclusively for researchers to experiment with: https://yourstory.com/2017/05/google-cloud-tpus-machine-learning-models-training/
If you want more control over how your prediction is run, for the same price you can configure a Google Compute Engine with a high-powered Tesla K80 GPU. Your Tensorflow model will work there, too, and it is straightforward to set up.
My suggestion would be to make benchmark predictions using your GCE instance and then compare them to the ml-engine. If ml-engine is faster than GCE, then Google is probably using GPUs for prediction. Surely, their goal is to provide GPUs and TPUs as an ml-engine in the future, but demand is overloading the HPC cloud these days.
Online predictions on GCP ML Engine use Single core CPUs by default which have high latency. If it suits your requirement, you can use the Quad core CPU which serve predictions faster. For using that, you must specify the type of CPU for predictions which creating a version of your model on ML Engine. Link to the documentation : https://cloud.google.com/ml-engine/docs/tensorflow/online-predict.
We support GPU now. Documentation here!
Example:
gcloud beta ai-platform versions create version_name \
--model model_name \
--origin gs://model-directory-uri \
--runtime-version 2.1 \
--python-version 3.7 \
--framework tensorflow \
--machine-type n1-standard-4 \
--accelerator count=1,type=nvidia-tesla-t4 \
--config config.yaml
If you use one of the Compute Engine (N1) machine types for your model version, you can optionally add GPUs to accelerate each prediction node.
NVIDIA Tesla K80
NVIDIA Tesla P4
NVIDIA Tesla P100
NVIDIA Tesla T4
NVIDIA Tesla V100
This site has some information:
https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction
However, it is a completely different way to train and predict. They provide the means for training and predictiong on their service infrastructure. You just build the model with your Tensorflow program and then use their hardware with their cloud SDK. So it shouldn't bother you whether it runs on CPU or GPU.