Google cloud jobs submit training gets stuck - machine-learning

Hello while I had set up google cloud machine learning to train a neural network , suddenly I am unable to submit jobs to google cloud.
There is no error but the command hangs there without doing anything , Also my instance is running .Here is the command:
gcloud ml-engine jobs submit training job9123 --runtime-version 1.0 --job-dir gs://dataset1_giorgaros2 --package-path trainmodule --module-name trainmodule.nncloud --region europe-west1 --config cloudml-gpu.yaml -- --train-file gs://dataset1_giorgaros2/nnn.p
Thank You !

ML engine job logs could help to obtain more details about the failed job execution, in most of the cases the log file contains the cause for the failure.
Finding the job logs on ML engine
If you are trying the same command each time over the training job execution, you might be obtaining an error regarding to the job name, this due to the name must be unique for each job on ML engine as it is described over the naming convention rules on ML engine jobs.
ML Engine name convention

Try checking network connectivity to google compute engine.
Check logs from the run - https://console.cloud.google.com/
And of course, read the docs:
https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/submit/training

Related

Unable to submit PySpark job while context lives in Jupyter Lab

I have created a Spark standalone cluster on Docker which can be found here.
The issue that I'm facing is that when I run the first cell in JupyterLab to create a SparkContext I lose the ability to submit jobs (Python programs). I keep getting the message:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I'm not sure where the issue is, but it seems like the Driver is blocked?
I don't know how to actually formulate the question since I can submit PySpark jobs when the app from Jupyter is not submitted.

Dataflow Pipeline Follows Notebook Execution Number. Cant Update Pipeline

I am trying to update my dataflow pipeline. I like developing using Jupyter notebooks on Google Cloud. However, I've run into this error when trying to update:
"The new job is missing steps [5]: read/Read."
I understand the reason is because I re-ran some lines in my notebook and added some new lines, so now instead of "[5]: read/Read" it is now "[23]: read/Read" but surely dataflow doesn't need to care about the jupyter notebook execution. Is there some sort of way to turn it off and just call the steps using the given names without the numbers?
The Notebooks documentation recommends restarting the kernel and rerunning all cells to avoid that behavior.
"(Optional) Before using your notebook to run Dataflow jobs, restart the kernel, rerun all cells, and verify the output. If you skip this step, hidden states in the notebook might affect the job graph in the pipeline object."
Doing so will preserve the numbers.

Problem running Dask on AWS Sagemaker and AWS Fargate

I am trying to setup a cluster on AWS to run distributed sklearn model training with dask. To get started, I was trying to follow this tutorial which I hope to tweak: https://towardsdatascience.com/serverless-distributed-data-pre-processing-using-dask-amazon-ecs-and-python-part-1-a6108c728cc4
I have managed to push the docker container to AWS ECR and then launch a CloudFormation template to build a cluster on AWS Fargate. The next step in the tutorial is to launch an AWS Sagemaker Notebook. I have tried this but something is not working because when I run the commands I get errors (see image). What might the problem be? Could it be related to the VPC/subnets? Is it related to AWS Sagemaker internet access? (I have tried enabling and disabling this).
Expected Results: dask to update, scaling up of the Fargate cluster to work.
Actual Results: none of the above.
In my case, when running through the same tutorial, DaskSchedulerService takes too long to complete. The creation was initiated but never finished in CloudFormation.
After 5-6 hours i've got the following:
DaskSchedulerService CREATE_FAILED Dask-Scheduler did not stabilize.
The workers did not run, and, consequently, it was not possible to connect to the Client.

Google Bigtable export hangs, is stuck, then fails in Dataflow. Workers never allocated

I'm trying to use this process:
https://cloud.google.com/bigtable/docs/exporting-sequence-files
to export my bigtable for backup. I've tried bigtable-beam-import versions 1.1.2 and 1.3.0 with no success. The program seems to kick off a Dataflow properly, but no matter what settings I use, workers never seem to get allocated to the job. The logs always say:
Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently running step(s).
Then it hangs and workers never get allocated. If I let it run, the logs say:
2018-03-26 (18:15:03) Workflow failed. Causes: The Dataflow appears to be stuck. Workflow failed. Causes: The Dataflow appears to be stuck. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
then it gets cancelled:
Cancel request is committed for workflow job...
I think I've tried changing all the possible Pipeline options desrcrbed here:
https://cloud.google.com/dataflow/pipelines/specifying-exec-params
I've tried turning Autoscaling off and specifying the number of workers like this:
java -jar bigtable-beam-import-1.3.0-shaded.jar export \
--runner=DataflowRunner \
--project=mshn-preprod \
--bigtableInstanceId=[something]\
--bigtableTableId=[something] \
--destinationPath=gs://[something] \
--tempLocation=gs://[something] \
--maxNumWorkers=10 \
--zone=us-central1-c \
--bigtableMaxVersions=1 \
--numWorkers=10 \
--autoscalingAlgorithm=NONE \
--stagingLocation=gs:[something] \
--workerMachineType=n1-standard-4
I also tried specifying the worker machine type. Nothing changes. Always autoscaling to 0 and fail. If there are people from the Dataflow team on, you can check out failed job ID: exportjob-danleng-0327001448-2d391b80.
Anyone else experience this?
After testing lots of changes to my GCloud project permissions, checking my quotas, etc. it turned out that my issue was with networking. This Stack Overflow question/answer was really helpful:
Dataflow appears to be stuck 3
It turns out that our team had created some networks/subnets in the gcloud project and removed the default network. When dataflow was trying to create VMs for the workers to run, it failed because it was unable to do so in the "default" network.
There was no error in the dataflow logs, just the one above about "dataflow being stuck." We ended up finding a helpful error message in the "Activity" stream on the gloud home page. We then solved the problem by creating a VPC literally called "default", with subnets called "default" in all the regions. Dataflow was then able to allocate VMs properly.
You should be able to pass network and subnet as pipeline parameters, but that didn't work for us using the BigTable export script provided (link in the question), but if you're writing Java code directly against the Dataflow API, you can probably fix the issue I had by setting the right network and subnet from your code.
Hope this helps anyone who is dealing with the symptoms we saw.

Google cloud platform setup ERROR: (gcloud.beta.ml) Invalid choice: 'init-project'

I am using cloud shell in google cloud platform. I am trying to getting things installed for machine learning. The codes that I have used so far are
curl https://storage.googleapis.com/cloud-ml/scripts/setup_cloud_shell.sh | bash
export PATH=${HOME}/.local/bin:${PATH}
curl https://storage.googleapis.com/cloud-ml/scripts/check_environment.py | python
gcloud beta ml init-project
It works fine in the first three lines but for the last command, I get
ERROR: (gcloud.beta.ml) Invalid choice: 'init-project'.
Usage: gcloud beta ml [optional flags] <group>
group may be language | speech | video | vision
For detailed information on this command and its flags, run:
gcloud beta ml --help
this error for the last gcloud~ line. Anyone knows what I can do to solve this problem?
Thank you.
First off, let me note that you don't need to run the BETA command as the gcloud ml variant is also available.
As the error message indicates, 'init-project' is not a valid choice, you should instead use one of the following groups: language, speech, video, vision, each of which allows you to make calls to the corresponding API. For instance, you could run the following:
$gcloud ml vision detect-faces IMAGE_PATH
and detect faces within the indicated image.
That said, from your comments it appears that you are not interested in any of the above. If you are looking to train your own TensorFlow models on google cloud platform, you should take a look at the docs relating to Cloud ML Engine. The page that dsesto pointed you to is a good start. I would advise that you also try out the examples in this github repository, particularly the census one. Once there, you'll also see that the gcloud command group used for training models on the cloud (as well as deploying them and using them for prediction jobs) is actually gcloud ml-engine, not gcloud ml.

Resources