I am trying to setup a cluster on AWS to run distributed sklearn model training with dask. To get started, I was trying to follow this tutorial which I hope to tweak: https://towardsdatascience.com/serverless-distributed-data-pre-processing-using-dask-amazon-ecs-and-python-part-1-a6108c728cc4
I have managed to push the docker container to AWS ECR and then launch a CloudFormation template to build a cluster on AWS Fargate. The next step in the tutorial is to launch an AWS Sagemaker Notebook. I have tried this but something is not working because when I run the commands I get errors (see image). What might the problem be? Could it be related to the VPC/subnets? Is it related to AWS Sagemaker internet access? (I have tried enabling and disabling this).
Expected Results: dask to update, scaling up of the Fargate cluster to work.
Actual Results: none of the above.
In my case, when running through the same tutorial, DaskSchedulerService takes too long to complete. The creation was initiated but never finished in CloudFormation.
After 5-6 hours i've got the following:
DaskSchedulerService CREATE_FAILED Dask-Scheduler did not stabilize.
The workers did not run, and, consequently, it was not possible to connect to the Client.
Related
What are the "best practices" workflow for developing and testing an image (locally I guess) that is going to be deployed into a K8s cluster, and that has different hardware than my laptop?
To explain the context a bit, I'm running some deep learning code that needs gpus and my laptop doesn't have any so I launch a "training job" into the K8s cluster (K8s is probably not meant to be used this way, but is the way that we use it where I work) and I'm not sure how I should be developing and testing my Docker images.
At the moment I'm creating a container that has the desired gpu and manually running a bunch of commands till I can make the code work. Then, once I got the code running, I manually copy all the commands from history that made the code work and then copy them to a local docker file on my computer, compile it and push it to a docker hub, from which the docker image is going to be pulled the next time I launch a training job into the cluster, that will create a container from it and train the model.
The problem with this approach is that if there's a bug in the image, I have to wait until the deployment to the container to realize that my Docker file is wrong and I have to start the process all over again to change it. Also finding bugs from the output of kubectl logs is very cumbersome.
Is it a better way to do this?
I was thinking of installing docker into the docker container and use IntelliJ (or any other IDE) to attach it to the container via SSH and develop and test the image remotely; but I read in many places that this is not a good idea.
What would you recommend then instead?
Many thanks!!
Setting up a Rails app on AWS Fargate has been somewhat of a struggle, but the more I tried, the more I learned. I now have multiple tasks running multiple parts of my environment (websever, worker and a task queue). The last piece of the puzzle is establishing rails console access to this environment.
I've read articles on Medium: https://engineering.loyaltylion.com/running-an-interactive-console-on-amazon-ecs-c692f321b14d, but it seems to be depending on EC2 instead of Fargate.
Then I found this post on SO: How to launch a rails console in a Fargate container
It seems that the solution is to set-up a VPN into my VPC. Since I'm not an expert on networking, I was wondering if there is a clear guide on how to set-up a VPN to my VPN on Mac?
And if I finally succeed with setting up this VPN, how would I then be running rails c? Is there some AWS CLI command I need to run? Do I need to define a separate task that runs the command... or?
To run rails console on a container running on fargate, you would need to run a docker exec, but it is not supported yet. There is an open issue for this: https://github.com/aws/containers-roadmap/issues/187
I've read the Cloud Composer overview (https://cloud.google.com/composer/) and documentation (https://cloud.google.com/composer/docs/).
It doesn't seem to mention failover.
I'm guessing it does, since it runs on Kubernetes cluster. Does it?
By failover I mean if the airflow webserver or scheduler stops for some reason, does it get started automatically again?
Yes, since Cloud Composer is built on Google Kubernetes Engine, it benefits from all the fault tolerance of any other service running on Kubernetes Engine. Pod and machine failures are automatically healed.
Hi Stackoverflow community, I have a question regarding using Docker with AWS EC2. I am comfortable with EC2 but am very new to Docker. I code in Python 3.6 and would like to automate the following process:
1: start an EC2 instance with Docker (Docker image stored in ECR)
2: run a one-off process and return results (let's call it "T") in a CSV format
3: store "T" in AWS S3
4: Shut down the EC2
The reason for using an EC2 instance is because the process is quite computationally intensive and is not feasible for my local computer. The reason for Docker is to ensure the development environment is the same across the team and the CI facility (currently using circle.ci). I understand that interactions with AWS can mostly be done using Boto3.
I have been reading about AWS's own ECS and I have a feeling that it's geared more towards deploying a web-app with Docker rather than running a one-off process. However, when I searched around EC2 + Docker nothing else but ECS came up. I have also done the tutorial in AWS but it doesn't help much.
I have also considered running EC2 with a shell script (i.e. downloading docker, pulling the image, building the container etc)but it feels a bit hacky? Therefore my questions here are:
1: Is ECS really the most appropriate solution in his scenario? (or in other words is ECS designed for such operations?)
2: If so are there any examples of people setting-up and running a one-off process using ECS? (I find the set-up really confusing especially the terminologies used)
3: What are the other alternatives (if any)?
Thank you so much for the help!
Without knowing more about your process; I'd like to pose 2 alternatives for you.
Use Lambda
Pending just how compute intensive your process is, this may not be a viable option. However, if it something that can be distributed, Lambda is awesome. You can find more information about the resource limitations here. This route, you would simply write Python 3.6 code to perform your task and write "T" to S3.
Use Data Pipeline
With Data Pipeline, you can build a custom AMI (EC2) and use that as your image. You can then specify the size of the EC2 resource that you need to run this process. It sounds like your process would be pretty simple. You would need to define:
EC2resource
Specify AMI, Role, Security Group, Instance Type, etc.
ShellActivity
Bootstrap the EC2 instance as needed
Grab your code form S3, GitHub, etc
Execute your code (Include in your code writing "T" to S3)
You can also schedule the pipeline to run at an interval/schedule or call it directly from boto3.
New question:
I've followed the guestbook tutorial here: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/examples/guestbook/README.md
And the output of my commands match their outputs exactly. When I try to access the guestbook web server, the page does not load.
Specifically, I have the frontend on port 80, I have enabled http/s connections on the console for all instances, I have run the command:
gcloud compute firewall-rules create --allow=tcp:<PortNumberHere> --target-tags=TagNameHere TagNameHere-<PortNumberHere>
and also
cluster/kubectl.sh get services guestbook -o template --template='{{(index .status.loadBalancer.ingress 0).ip}}'
But when I run curl -v http://:, the connection simply times out.
What am I missing?
Old Question - Ignore:
Edit: Specifically, I have 3 separate docker images. How can I tell kubernetes to run these three images?
I have 3 docker images, each of which use each other to perform their tasks. One is influxdb, the other is a web app, and the third is an engine that does data processing.
I have managed to get them working locally on my machine with docker-compose, and now I want to deploy them on googles compute engine so that I can access it over the web. I also want to be able to scale the software. I am completely, 100% new to cloud computing, and have never used gce before.
I have looked at Kubernetes, and followed the docs, but I cannot get it to work on a gce instance. What am I missing/not understanding? I have searched and read all the docs I could find, but I still don't feel any closer to getting it than before.
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/getting-started-guides/gce.md
To get best results on SO you need to ask specific questions.
But, to answer a general question with a general answer, Google's Cloud Platform Kubernetes wrapper is Container Engine. I suggest you run through the Container Engine tutorials, paying careful attention to the configuration files, before you attempt to implement your own solution.
See the guestbook to get started: https://cloud.google.com/container-engine/docs/tutorials/guestbook
To echo what rdc said, you should definitely go through the tutorial, which will help you understand the system better. But the short answer to your question is that you want to create a ReplicationController and specify the containers' information in the pod template.