Google Colab Pro+ Timeout - timeout

I am using Google Colab to train yolov3 on a custom data-set. I have about 200000 iterations to run to finish the training. I decided to subscribe to Colab Pro+ subscription and initially the runtime was 24 hours and now it has gone down to less than 12 hours. I am not really sure why the runtime is being interrupted/stopped.
I know that there was no error while training yolov3 with custom data-set, so the runtime is being interrupted by timeout.

Related

Multiple colab sessions working in the same google sheet

I'm using Google Colab and Google sheets to automate some stuff. Would the process be faster if I split the script into parts that can run in multiple colab sessions (e.g. session 1 deals with steps 1-100, while session 2 deals with steps 101-200 in parallel) or can the sheets interface only handle one inquiry at a time?
Tested this for myself. Results were:
1 session did 1,15 iterations per minute
3 sessions did 1,7 (0,48 + 0,65 + 0,57) iterations per minute
I have noticed that performance depends on some limitations from Google that varies, so I cannot know for sure if it is correct until I've done more tests. First test looks promising though.

GCP Dataflow - Throughput gradually slows down, Workers underutilized

I have a Beam script running in GCP Dataflow. This data flow performs the below steps:
Read a number of files that are PGP encrypted. (Total size more than 100 GB, individual files are of 2 GB in size)
Decrypt the files to form a PCollection
Do a wait() on PCollection
Do some processing on each record in the PCollection before writing into an output file
Behavior seen with GCP Dataflow:
When reading the input files and decrypting the files, it starts with one workers, and then scales upto 30 workers. But, only one worker continues to be utilized, utilization in all other workers is less than 10 %
Initially, throughput was 150K records per second while decryption. So, 90% of the decryption gets completed in 1 hours, which is good. But, then the throughput slows down gradually, even to just 100 records per second. So, it takes another 1-2 hours to complete the remaining 10% of the workload.
Any idea why the workers are underutilized? If there is no utilization, why are they not scaled down? Here, I am paying unnecessarily for a large number of VM-s :-(. Second, why the throughput slows reduction towards the end, and thereby significantly increasing the time for completion?
There is an issue related to the throughput and input behavior of the Cloud Dataflow. I suggest you to track the improvements being made to the autoscaling and utilization behavior of workers here.
The default architecture for Dataflow worker processing and autoscaling is not as responsive in some cases compared to when the Dataflow Streaming Engine feature is enabled. I would recommend you to try running the relevant Dataflow pipeline with Streaming Engine enabled, since it provides a more responsive autoscaling performance based on CPU utilization for your pipeline.
I hope you find the above pieces of information useful.
Can you try to implement your solution without wait() ?
For example,
FileIO.match().filepattern() -> ParDo(DoFn to decrypt files) -> fileIO.readmatches() -> ParDo(DoFn to read files)
See the example here.
This should allow your pipeline to better parallelize.

Google Cloud Composer vCPU time Confusion

I've been trying Composer recently to run my pipeline, and found it cost surprisingly high than I thought, here is what I got from the bill:
Cloud Composer Cloud Composer vCPU time in South Carolina: 148.749 hours
[Currency conversion: USD to AUD using rate 1.475] A$17.11
Cloud Composer Cloud Composer SQL vCPU time in South Carolina: 148.749 hours
[Currency conversion: USD to AUD using rate 1.475] A$27.43
I only used Composer for two or three days, and definitely not running 24 hours per day, I don't know where the 148 hours come from.
Does that mean after you deploy the dag to composer, even it's not running, it's still using the resource and the composer is accumulating the vCPU time?
How to reduce cost if I want to use Composer to run my pipeline everyday? Thanks.
Cloud Composer primarily charges for compute resources allocated to an environment, because most of its components continue to run even when there are no DAGs deployed. This is because Airflow is primarily a workflow scheduler, so there's not much you can turn off and expect to be there when a workflow is suddenly ready to run.
In your case, the billed vCPU time is contributed to by your environment's GKE nodes, and your managed Airflow database. Aside from the GKE node count, there's not much you can reduce or turn off, so if you need anything smaller, you may want to consider self-managed Airflow or another platform entirely. Same comment applies if your primary objective is solely processing data and you don't need the scheduling aspect that's offered by Airflow.
At the moment, as I am aware of, is not a feature of composer yet.
At worker level, you should be able to do this by manually modifing the configuration of the composer and allow its kubernetes workers to scale up and down according to the workload.
Joshua Hendinata made a guide at the following link on the necessary step for enabling autoscaling of Composer [1].
Also perhaps may be of your interet this article where are introduced ways to save on composer costs [2].
Hope this helps you out!
[1] https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60
[2] https://medium.com/condenastengineering/automating-a-cloud-composer-development-environment-590cb0f4d880

Watching over SageMaker while it is training

I am using Amazon SageMaker to train a model with a lot of data.
This takes a lot of time - hours or even days. During this time, I would like be able to query the trainer and see its current status, particularly:
How many iterations it already did, and how many iterations it still needs to do? (the training algorithm is deep learning - it is based on iterations).
How much time does it need to complete the training?
Ideally, I would like to classify a test-sample using the model of the current iteration, to see its current performance.
One way to do this is to explicitly tell the trainer to print debug messages after each iteration. However, these messages will be availble only at the console from which I run the trainer. Since training takes so much time, I would like to be able to query the trainer status remotely, from different computers.
Is there a way to remotely query the status of a running trainer?
All logs are available in Amazon Cloudwatch. You can query CloudWatch programmatically or via an API to parse the logs.
Are you using built-in algorithms or a Framework like MXNet or TensorFlow? For TensorFlow you can monitor your job with TensorBoard.
Additionally, you can see high level job status using the describe training job API call:
import sagemaker
sm_client = sagemaker.Session().sagemaker_client
print(sm_client.describe_training_job(TrainingJobName='You job name here'))

Autoscaling in Google Cloud Dataflow is not working as expected

I am trying to enable autoscaling in my dataflow job as described in this article. I did that by setting the relevant algorithm via the following code:
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setAutoscalingAlgorithm(AutoscalingAlgorithmType.THROUGHPUT_BASED)
After I set this and deployed my job, it always works with the max. number of CPUs available, i.e. if I set max number of workers to 10, then it uses all 10 CPUs although average CPU usage is about 50%. How does this THROUGHPUT_BASED algorithm works and where I am making mistake?
Thanks.
Although Autoscaling tries to reduce both the backlog and CPU, backlog reduction takes priority. Specific values backlog matters, Dataflow calculates 'backlog in seconds' roughly as 'backlog / throughput' and tries to keep it below 10 seconds.
In your case, I think what is preventing downscaling from 10 is due to policy regarding persistent disks (PDs) used for pipeline execution. When max workers is 10, Dataflow uses 10 persistent disks and tries to keep the number of workers at any time such that these disks are distributed roughly equally. As a consequence when the pipeline is at its max workers of 10, it tries to downscale to 5 rather than 7 or 8. In addition, it tries to keep projected CPU after downscaling to no more than 80%.
These two factors might be effectively preventing downscaling in your case. If CPU utilization is 50% with 10 workers, the projected CPU utilization is 100% for 5 workers, so it does not downscale since it is above the target 80%.
Google Dataflow is working on new execution engine that does not depend on persistent disks and does not suffer from the limitation of amout of downscaling.
A work around for this is to set higher max_workers and your pipeline might still stay at 10 or below. But that incurs a small increase in cost for PDs.
Another remote possibility is that sometimes even after upscaling estimated 'backlog seconds' might not stay below 10 seconds even with enough CPU. This could be due to various factors (user code processing, pubsub batching, etc). Would like to hear if that is affecting your pipeline.

Resources