gcloud ml-engine predict is very slow on inference - google-cloud-ml-engine

I'm testing a segmentation model on gcloud and the inference is incredibly slow. It takes 3 min to get the result (averaged over 5 runs). Same model runs ~2.5 s on my laptop when running through tf-serving.
Is it normal? I didn't find any mention in the documentation on how to define the instance type and it seems impossible to run inference on GPU.
The steps I'm using is fairly straightforward and follows the examples and tutorials:
gcloud ml-engine models create "seg_model"
gcloud ml-engine versions create v1 \
--model "seg_model" \
--origin $DEPLOYMENT_SOURCE \
--runtime-version 1.2 \
--staging-bucket gs://$BUCKET_NAME
gcloud ml-engine predict --model ${MODEL_NAME} --version v1 --json-instances request.json
Upd: after running more experiments I found that redirecting output to a file gets the inference time down to 27s. Model output size is 512x512, which probably causes some delays on a client side. Although it is much lower than 3 min, it is still an order of magnitude slower than tf-serving.

Related

Yolov5 Training keep running on local system

I recently bought a GPU (RTX 3060 Ti) before that I used to work on google collab (Free version). I have downloaded yolov5 on my local machine and made environment variable for it and downloaded the required dependency libraries. I ran training for 3 epochs to test my gpu with same dataset which I use on collab that takes only around 30 seconds to complete (Tesla T4 which has around 2000 cuda cores less than RTX 3060 Ti)on the otherhand my GPU kept running for around 3 hours but didnt stop (So I Intrupted it).
Screenshot of Yolo in VS Code
The code I ran on my local machine is:
# !git clone https://github.com/ultralytics/yolov5 # clone
# %cd yolov5
%pip install -qr requirements.txt # install
import torch
import utils
display = utils.notebook_init() # checks
# Train YOLOv5s on COCO128 for 3 epochs
!python train.py --img 412 --batch 16 --epochs 3 --data train_data/data.yaml --weights yolov5s.pt

Google Dataflow creates only one worker for large .bz2 file

I am trying to process the Wikidata json dump using Cloud Dataflow.
I have downloaded the file from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 and hosted it into a GS bucket. It's a large (50G) .bz2 file containing a list of json dicts (one per line).
I understand that apache_beam.io.ReadFromText can handle .bz2 (I tested that on toy datasets) and that .bz2 is splittable. Therefore I was hoping that multiple workers would be created that would work in parallel on different blocks of that unique file (I'm not totally clear if/how blocks would res.
Ultimately I want to do some analytics on each line (each json dict) but as a test for ingestion I am just using the project's wordcount.py:
python -m apache_beam.examples.wordcount \
--input gs://MYBUCKET/wikidata/latest-all.json.bz2 \
--output gs://MYBUCKET/wikidata/output/entities-all.json \
--runner DataflowRunner \
--project MYPROJECT \
--temp_location gs://MYBUCKET/tmp/
At startup, autoscaling quickly increases the number of workers 1->6 but only one worker does any work and then autoscaling scales back 6->1 after a couple minutes (jobid: 2018-10-11_00_45_54-9419516948329946918)
If I disable autoscaling and set explicitly the number of workers, then all but one remain idle.
Can parallelism be achieved on this sort of input? Many thanks for any help.
Other than Hadoop, Apache Beam has not yet implemented bzip2 splitting: https://issues.apache.org/jira/browse/BEAM-683

Error while running model training in google cloud ml

I want to run model training in the cloud. I am following this link which runs a sample code to train a model based on flower dataset. The tutorial consists of 4 stages:
Set up your Cloud Storage bucket
Preprocessing training and evaluation data in the cloud
Run model training in the cloud
Deploying and using the model for prediction
I was able to complete step 1 and 2, however in step 3, job is successfully submitted but somehow error occurs and task exits with non exit status 1. Here is the log of the task
Screenshot of expanded log is:
I used following command:
gcloud ml-engine jobs submit training test${JOB_ID} \
--stream-logs \
--module-name trainer.task \
--package-path trainer\
--staging-bucket ${BUCKET_NAME} \
--region us-central1 \
--runtime-version=1.2 \
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*"
Thanks in advance!
Can you please confirm that the input files (eval_data_paths and train_data_paths) are not empty? Additionally if you are still having issues can you please file an issue https://github.com/GoogleCloudPlatform/cloudml-samples since its easier to handle the issue on Github.
I met the same issue and couldn't figure out, then I followed this, do it again from git clone and there was no error after running on gcs.
It is clear from your error message
The replica worker 1 exited with a non-zero status of 1. Termination reason: Error
that you have some programming error (syntax, undefined etc).
For more information, Check the return code and meaning
Return code -------------Meaning-------------- Cloud ML Engine response
0 Successful completion Shuts down and releases job resources.
1-128 Unrecoverable error Ends the job and logs the error.
Your need to find your bug first and fix it, then try again.
I recommend run your task locally (if your configuration supports) before you submit in cloud. If you find any bug, you can fix easily in your local machine.

How does one write a Cluster Spec for Distributed YoutTube-8m challenge training?

Can someone please post a ClusterSpec for distributed training of the models defined in the YouTube-8m Challenge code?
The code tries to load a cluster spec from TF_CONFIG environment variable. However, I'm not sure what the value for TF_CONFIG should be. I have access to 2 GPUs on one machine and just want to run the model with data-level parallelism.
If you want to run YouTube 8m challenge code in a distributed manner, you have to write a yaml file (There is an example yaml file provided by Google) and then you need to pass as parameter where is located this yaml file. TF_CONFIG makes reference to the configuration variables used to train the model.
For example, for running on google cloud the starting code in a distributed manner, I have used:
JOB_NAME=yt8m_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
submit training $JOB_NAME \
--package-path=youtube-8m --module-name=youtube-8m.train \
--staging-bucket=$BUCKET_NAME --region=us-east1 \
--config=youtube-8m/cloudml-gpu-distributed.yaml \
-- --train_data_pattern='gs://youtube8m-ml-us-east1/1/frame_level/train/train*.tfrecord' \
--frame_features=True --model=LstmModel --feature_names="rgb,audio" \
--feature_sizes="1024, 128" --batch_size=128 \
--train_dir=$BUCKET_NAME/${JOB_TO_EVAL}
The parameter config is pointing to the yaml file cloudml-gpu-distributed.yaml with the following specification:
trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 2
workerType: standard_gpu
parameterServerCount: 2
parameterServerType: standard

Vowpal Wabbit doesn't save the model despite -f flag is present

I encountered the following unexplainable behaviour in Vowpal Wabbit. Sometimes it simply doesn't save a model when -f flag is specified, without raising any exceptions.
The command is composed automatically by a script and has the following form (file names are changed):
vw -d ./data/train_set -p ./predictions
-f ./model --cache --passes 3
--ftrl_alpha 0.106920149657 --ignore T -l 0.83184072971
-b 29 --loss_function logistic --ftrl_beta 0.97391780827
--ftrl -q SE -q SZ -q DR
Then it trains normally and the standard diagnostic information is displayed. But the model is not saved!
The most weird thing about it is that everything works fine with another parameter configurations!
The context: I'm working on hyperparameter optimization and my script successively composes vw training and validation commands. It always succeeds to get to 5th iteration, and always fails on the 6th (on exactly the same command). Any help will be appreciated.
That was a bug in Vowpal Wabbit source code. Now it's fixed and models are saved as expected. Here is an issue on Github:
https://github.com/JohnLangford/vowpal_wabbit/issues/859

Resources