Privileged / capabilities in a Dataflow container - google-cloud-dataflow

I'm trying to run a Dataflow application such that the container it runs in will be privileged, or at least will have certain capabilities (such as CAP_SYS_PTRACE).
Taking the top_wikipedia_sessions.py as an example, I can run it this way with Apache Beam:
python3 -m apache_beam.examples.complete.top_wikipedia_sessions \
--region us-central1 \
--runner DataflowRunner \
--project my_project \
--temp_location gs://my-cloud-storage-bucket/temp/ \
--output gs://my-cloud-storage-bucket/output/
If I SSH into the created instance, I can see with docker ps that the started container is not privileged and has no capabilities kept (nothing in its CapAdd). I couldn't find any way in Apache Beam to control it. I suppose I could SSH into the instances and update their Docker settings, but I wonder if there's a way around it that doesn't require manually modifying the instances Dataflow starts for me. Perhaps it's a settings I need to update at the cluster settings of GCP?

There is no way currently to directly modify the launch of the docker containers on a Dataflow worker.
However, jobs with GPUs enabled do run their containers in privileged mode. This has cost implications but could be a way to experimentally confirm that the feature addresses your need. If you can share more about the specific use case, perhaps it will make sense to generalize this feature to non-GPU jobs.

Since your goal is to profile you job with py-spy, have you considered and used Dataflow's profiling capabilities?
Docs on profiling a Dataflow Python pipeline
Essay on profiling Dataflow Python
Docs on Cloud Profiler and Python in case you need to dig further
Essay on profiling Dataflow Java could have useful info

Related

Use nohup to run a long process in docker at a remote server

I used to run a long training process on a remote server with GPU capabilities. Now my work schedule changes, so I can't have my computer connected to a network all the time till I finish the process. I found that nohup is the solution for me. But I don't know how to keep invoke the process correctly related my situation.
I use ssh to connect to the remote server.
I have to use docker to access to GPU.
Then I start the process in the docker.
If I start the process with nohup in docker, I can't really leave docker, right. So, do I use nohup at each step?
Edit:
I need the terminal output of the process at step 3, because I need that information to carry out the rest of the work. Consider, step 3 is training a neural network. So, the training log tells me the accuracy of different models at different iterations. I use that information to do the testing.
Following #David Maze's suggestion, I did this (a slightly different approach as I was not familiar with docker a whole lot)
Logged in to the remote server.
Configured the docker script to have remote workdir.
...
WORKDIR /workspace
...
After building the docker container, run docker with mount option to mount the local project to docker workdir. When running docker, I used nohup. Since I don't need interactive mode I ignored the -it flag.
nohup docker run --gpus all -v $(pwd)/path-to-project-root:/workspace/ docker-image:tag bash -c "command1; command2" > project.out 2>&1 &
To test this, I logged out from the server and see the content of project.out later. It contained the expected output.

Running DBT within Airflow through the Docker Operator

Building my question on How to run DBT in airflow without copying our repo, I am currently running airflow and syncing the dags via git. I am considering different option to include DBT within my workflow. One suggestion by louis_guitton is to Dockerize the DBT project, and run it in Airflow via the Docker Operator.
I have no prior experience using the Docker Operator in Airflow or generally DBT. I am wondering if anyone has tried or can provide some insights about their experience incorporating that workflow, my main questions are:
Should DBT as a whole project be run as one Docker container, or is it broken down? (for example: are tests ran as a separate container from dbt tasks?)
Are logs and the UI from DBT accessible and/or still useful when run via the Docker Operator?
How would partial pipelines be run? (example: wanting to run only a part of the pipeline)
Judging by your questions, you would benefit from trying to dockerise dbt on its own, independently from airflow. A lot of your questions would disappear. But here are my answers anyway.
Should DBT as a whole project be run as one Docker container, or is it broken down? (for example: are tests ran as a separate container from dbt tasks?)
I suggest you build one docker image for the entire project. The docker image can be based on the python image since dbt is a python CLI tool. You then use the CMD arguments of the docker image to run any dbt command you would run outside docker.
Please remember the syntax of docker run (which has nothing to do with dbt): you can specify any COMMAND you wand to run at invocation time
$ docker run [OPTIONS] IMAGE[:TAG|#DIGEST] [COMMAND] [ARG...]
Also, the first hit on Google for "docker dbt" is this dockerfile that can get you started
Are logs and the UI from DBT accessible and/or still useful when run via the Docker Operator?
Again, it's not a dbt question but rather a docker question or an airflow question.
Can you see the logs in the airflow UI when using a DockerOperator? Yes, see this how to blog post with screenshots.
Can you access logs from a docker container? Yes, Docker containers emit logs to stdout and stderr output streams (which you can see in airflow, since airflow picks this up). But logs are also stored in JSON files on the host machine in a folder /var/lib/docker/containers/. If you have any advanced needs, you can pick up those logs with a tool (or a simple BashOperator or PythonOperator) and do what you need with it.
How would partial pipelines be run? (example: wanting to run only a part of the pipeline)
See answer 1, you would run your docker dbt image with the command
$ docker run my-dbt-image dbt run -m stg_customers

docker container "post start" activity

I'm new to docker and I'm starting of building, deploying, and maintaining telemetry like services (grafana, prometheus, ...). One thing I've come accross is that I have a need to start up grafana with some default/preconfigured settings (dashboard, users, org, datasources, ...). Grafana allows some startup configuration in its config file but not with all its features (users, org, ...). Outside of (if I weren't using) docker I use a ansible script to configure the not supported parts of grafana. However, when I build my custom grafana image (with allowed startup config) and later start a grafana container of that image is there a way to specify "post-start" commands or steps in docker file? I image it to be something like every time a container of my image is deployed some steps are issues to configure that container.
Any suggestions? Would I still need to use ansible or other tools like this to manage it?
This is trickier than it sounds. Continuing to use Ansible to configure it post-startup is probably a good compromise between being straightforward, code you already have, and using standard Docker tooling and images.
If this is for a test environment, one possibility is to keep a reference copy of Grafana's config and data directories. You'd have to distribute these separately from the Docker images.
mkdir grafana
docker run \
-v $PWD/grafana/config:/etc/grafana \
-v $PWD/grafana/data:/var/lib/grafana \
... \
grafana/grafana
...
tar cvzf grafana.tar.gz grafana
Once you have the tar file, you can restart the system from a known configuration:
tar xvzf grafana.tar.gz
docker run \
-v $PWD/grafana/config:/etc/grafana \
-v $PWD/grafana/data:/var/lib/grafana \
... \
grafana/grafana
Several of the standard Docker Hub database images have the ability to do first-time configuration, via an entrypoint script; I'll refer to the mysql image's entrypoint script here. The basic technique involves:
Determine whether the command given to start the container is to actually start the server, and if this is the first startup.
Start the server, as a background process, recording its pid.
Wait for the server to become available.
Actually do the first-time initialization.
Stop the server that got launched as a background process.
Go on to exec "$#" as normal to launch the server "for real".
The basic constraint here is that you want the server process to be the only thing running in the container once everything is done. That means commands like docker stop will directly signal the server, and if the server fails, it's the main container process so that will cause the container to exit. Once the entrypoint script has replaced itself with the server as the main container process (by execimg it), you can't do any more post-startup work. That leads to the sequence of starting a temporary copy of the server to do initialization work.
Once you've done this initialization work once the relevant content is usually stored in persisted data directories or external databases.
SO questions have a common shortcut of starting a server process in the background, and then using something like tail -f /dev/null as the actual main container process. This means that docker stop will signal the tail process, but not tell the server that it's about to shut down; it also means that if the server does fail, since the tail process is still running, the container won't exit. I'd discourage this shortcut.

What are the Docker RUN params for mimicking IronWorker memory constraints?

In the past I've run into trouble when hosting my workers in a cloud infrastructure because of memory constraints that weren't faithfully reproduced when testing the code locally on my overpowered machine.
IronWorker is one such cloud provider that limits workers in its multi-tenant infrastructure to 380mb. Luckily with their switch to docker, I can hope to catch problems early on by asking my local docker container to use artificial memory limits when testing.
But I'm not sure as to which parameters from the following: https://docs.docker.com/engine/reference/run/ are the right ones to use when setting a 380mb limit ... any advice?
Does the logic from https://goldmann.pl/blog/2014/09/11/resource-management-in-docker/#_example_managing_the_memory_shares_of_a_container still apply?
You'll want to use --memory, for example, based on the node README:
docker run --memory 380M --rm -e "PAYLOAD_FILE=hello.payload.json" -v "$PWD":/worker -w /worker iron/node node hello.js

How do I run Docker on Google Compute Engine?

What's the procedure for installing and running Docker on Google Compute Engine?
Until the recent GA release of Compute Engine, running Docker was not supported on GCE (due to kernel restrictions) but with the newly announced ability to deploy and use custom kernels, that restriction is no longer intact and Docker now works great on GCE.
Thanks to proppy, the instructions for running Docker on Google Compute Engine are now documented for you here: http://docs.docker.io/en/master/installation/google/. Enjoy!
They now have a VM which has docker pre-installed now.
$ gcloud compute instances create instance-name
--image projects/google-containers/global/images/container-vm-v20140522
--zone us-central1-a
--machine-type f1-micro
https://developers.google.com/compute/docs/containers/container_vms
A little late, but I wanted to add an answer with a more detailed workflow and links, since answers are still rather scattered:
Create a Docker image
a. Locally
b. Using Google Container Builder
Push local Docker image to Google Container Repository
docker tag <current name>:<current tag> gcr.io/<project name>/<new name>
gcloud docker -- push gcr.io/<project name>/<new name>
UPDATE
If you have upgraded to Docker client versions above 18.03, gcloud docker commands are no longer supported. Instead of the above push, use:
docker push gcr.io/<project name>/<new name>
If you have issues after upgrading, see more here.
Create a compute instance.
This process actually obfuscates a number of steps. It creates a virtual machine (VM) instance using Google Compute Engine, which uses a Google-provided, container-optimized OS image. The image includes Docker and additional software responsible for starting our docker container. Our container image is then pulled from the Container Repository, and run using docker run when the VM starts. Note: you still need to use docker attach even though the container is running. It's worth pointing out only one container can be run per VM instance. Use Kubernetes to deploy multiple containers per VM (the steps are similar). Find more details on all the options in the links at the bottom of this post.
gcloud beta compute instances create-with-container <desired instance name> \
--zone <google zone> \
--container-stdin \
--container-tty \
--container-image <google repository path>:<tag> \
--container-command <command (in quotes)> \
--service-account <e-mail>
Tip You can view available gcloud projects with gcloud projects list
SSH into the compute instance.
gcloud beta compute ssh <instance name> \
--zone <zone>
Stop or Delete the instance. If an instance is stopped, you will still be billed for resources such as static IPs and persistent disks. To avoid being billed at all, use delete the instance.
a. Stop
gcloud compute instances stop <instance name>
b. Delete
gcloud compute instances delete <instance name>
Related Links:
More on deploying containers on VMs
More on zones
More create-with-container options
As of now, for just Docker, the Container-optimized OS is certainly the way to go:
gcloud compute images list --project=cos-cloud --no-standard-images
It comes with Docker and Kubernetes preinstalled. The only thing it lacks is the Cloud SDK command-line tools. (It also lacks python3, despite Google's announce of Python 2 sunset on 2020-01-01. Well, it's still 27 days to go...)
As an additional piece of information I wanted to share, I was searching for a standard image that would offer both docker and gcloud/gsutil preinstalled (and found none, oops). I do not think I'm alone in this boat, as gcloud is the thing you could hardly go by without on GCE¹.
My best find so far was the Ubuntu 18.04 image that came with their own (non-Debian) package manager, snap. The image comes with the Cloud SDK preinstalled, and Docker installs literally in a snap, 11 seconds on an F1 instance initial test, about 6s on an n1-standard-1. The only snag I hit was the error message that the docker authorization helper was not available; an attempt to add it with gcloud components install failed because the SDK was installed as a snap, too. However, the helper is actually there, only not in the PATH. The following was what got me the both tools available in a single transient builder VM in the least amount of setup script runtime, starting off the supported Ubuntu 18.04 LTS image²:
snap install docker
ln -s /snap/google-cloud-sdk/current/bin/docker-credential-gcloud /usr/bin
gcloud -q auth configure-docker
¹ I needed both for a Daisy workflow imaging a disk with both artifacts from GS buckets and a couple huge, 2GB+ library images from the local gcr.io registry that were shared between the build (as cloud builder layers) and the runtime (where I had to create and extract containers to the newly built image). But that's besides the point; one may needs both tools for a multitude of possible reasons.
² Use gcloud compute images list --uri | grep ubuntu-1804 to get the most current one.
Google's GitHub site offers now a gce image including docker. https://github.com/GoogleCloudPlatform/cloud-sdk-docker-image
It's as easy as:
creating a Compute Engine instance
curl https://get.docker.io | bash
Using docker-machine is another way to host your google compute instance with docker.
docker-machine create \
--driver google \
--google-project $PROJECT \
--google-zone asia-east1-c \
--google-machine-type f1-micro $YOUR_INSTANCE
If you want to login this machine on google cloud compute instance, just use docker-machine ssh $YOUR_INSTANCE
Refer to docker machine driver gce
There is now improved support for containers on GCE:
Google Compute Engine is extending its support for Docker containers. This release is an Open Preview of a container-optimized OS image that includes Docker and an open source agent to manage containers. Below, you'll find links to interact with the community interested in Docker on Google, open source repositories, and examples to get started. We look forward to hearing your feedback and seeing what you build.
Note that this is currently (as of 27 May 2014) in Open Preview:
This is an Open Preview release of containers on Virtual Machines. As a result, we may make backward-incompatible changes and it is not covered by any SLA or deprecation policy. Customers should take this into account when using this Open Preview release.
Running Docker on GCE instance is not supported. The instance goes down and not able to login again.
We can use the Docker image given by the GCE, to create a instance.
If your google cloud virtual machine is based on ubuntu use the following command to install docker
sudo apt install docker.io
You may use this link: https://cloud.google.com/cloud-build/docs/quickstart-docker#top_of_page.
The said link explains how to use Cloud Build to build a Docker image and push the image to Container Registry. You will first build the image using a Dockerfile and then build the same image using the Cloud Build's build configuration file.
Its better to get it while creating compute instance
Go to the VM instances page.
Click the Create instance button to create a new instance.
Under the Container section, check Deploy container image.
Specify a container image name under Container image and configure options to run the container if desired. For example, you can specify gcr.io/cloud-marketplace/google/nginx1:1.12 for the container image.
Click Create.
Installing Docker on GCP Compute Engine VMs:
This is the link to GCP documentation on the topic:
https://cloud.google.com/compute/docs/containers#installing
In it it links to the Docker install guide, you should follow the instructions depending on what type of Linux you have running in the vm.

Resources