Using custom docker containers in Dataflow

Using custom docker containers in Dataflow - google-cloud-dataflow

From this link I found that Google Cloud Dataflow uses Docker containers for its workers: Image for Google Cloud Dataflow instances
I see it's possible to find out the image name of the docker container.
But, is there a way I can get this docker container (ie from which repository do I go to get it?), modify it, and then indicate my Dataflow job to use this new docker container?
The reason I ask is that we need to install various C++ and Fortran and other library code on our dockers so that the Dataflow jobs can call them, but these installations are very time consuming so we don't want to use the "resource" property option in df.

Update for May 2020
Custom containers are only supported within the Beam portability framework.
Pipelines launched within portability framework currently must pass --experiments=beam_fn_api explicitly (user-provided flag) or implicitly (for example, all Python streaming pipelines pass that).
See the documentation here: https://cloud.google.com/dataflow/docs/guides/using-custom-containers?hl=en#docker
There will be more Dataflow-specific documentation once custom containers are fully supported by Dataflow runner. For support of custom containers in other Beam runners, see: http://beam.apache.org/documentation/runtime/environments.
The docker containers used for the Dataflow workers are currently private, and can't be modified or customized.
In fact, they are served from a private docker repository, so I don't think you're able to install them on your machine.

Update Jan 2021: Custom containers are now supported in Dataflow.
https://cloud.google.com/dataflow/docs/guides/using-custom-containers?hl=en#docker

you can generate a template from your job (see https://cloud.google.com/dataflow/docs/templates/creating-templates for details), then inspect the template file to find the workerHarnessContainerImage used
I just created one for a job using the Python SDK and the image used in there is dataflow.gcr.io/v1beta3/python:2.0.0
Alternatively, you can run a job, then ssh into one of the instances and use docker ps to see all running docker containers. Use docker inspect [container_id] to see more details about volumes bound to the container etc.

Related

Run Docker container through Oozie

I'm trying to build an Oozie workflow to execute everyday a python script which needs specific libraries to run.
At the moment I created a python virtual environment (using venv) on a node of my cluster (consisting of 11 nodes).
Through Oozie I saw that it is possible to run the script using an SSH Action specifying the node containing the virtual environment. Alternatively it is possible to use a Shell Action to run the python script but this requires creating the virtual environment, with the same dependencies in terms of libraries, on the node where the shell will be executed (any of the cluster nodes).
I would like to avoid sharing keys or configuring all the cluster nodes to make this possible and looking in the docs I found this section talking about launching applications using Docker containers but in Hadoop version of my cluster this feature is experimental and not complete (Hadoop 3.0.0). I suppose that if you can launch Docker containers from shell you should be able to launch them from Oozie.
So my question is: has anyone tried to do it? Is it a trick to use docker this way?
I came across this question but to date 2019/09/30 there are no specific answers.
UPDATE: I tried to do it, and it works (you can find more info in my answer to this question). I'm still wondering if it's a correct way to do it.

How can I combine images into one container in Docker (as IIB is dependent on MQ)?

I am new to docker, and as I am trying to configure an IBM integration environment using docker. I have used docker pull command to install two different images, one is the IBM Integration Bus (IIB) and the other is IBM Message Queueing (MQ). Then I ran each of the images in separate containers using docker run command.
The problem is, IIB is depending on MQ for its broker creation. I created the queue manager in the MQ container, and then created a broker in the iib container. I need a way to link the broker contianer with queue manager container, or combine them if possible.
https://developer.ibm.com/messaging/learn-mq/mq-tutorials/mq-connect-to-queue-manager/#docker
https://hub.docker.com/r/ibmcom/iib/
Can someone help and provide instructions?

If you can work with the latest IIB version, now called ACE, look at these images.

does docker remote api support creating a container using a docker-compose file?

I have an app that creates docker containers using the docker remote api, which is done using this library.
So far it is working fine with simple configuration options for the container creation. Now I need to create the container with much more config options, so wondering if i can use a docker-compose file. This api is created based on v1.23 of docker remote api spec, does docker remote api support creating a container using a compose file?
I cannot find an option from this documentation. but wondering if i am looking in wrong place.

No; Docker Compose itself is an application that uses the API. You’d need to directly run docker-compose up or something similar as a shell command if you wanted to directly use it.
(You might be able to hack into its internals if you have a Python program, but not from Java.)

CI testing with docker-compose on Jenkins with Kubernetes

I have tests that I run locally using a docker-compose environment.
I would like to implement these tests as part of our CI using Jenkins with Kubernetes on Google Cloud (following this setup).
I have been unsuccessful because docker-in-docker does not work.
It seems that right now there is no solution for this use-case. I have found other questions related to this issue; here, and here.
I am looking for solutions that will let me run docker-compose. I have found solutions for running docker, but not for running docker-compose.
I am hoping someone else has had this use-case and found a solution.
Edit: Let me clarify my use-case:
When I detect a valid trigger (ie: push to repo) I need to start a new job.
I need to setup an environment with multiple dockers/instances (docker-compose).
The instances on this environment need access to code from git (mount volumes/create new images with the data).
I need to run tests in this environment.
I need to then retrieve results from these instances (JUnit test results for Jenkins to parse).
The problems I am having are with 2, and 3.
For 2 there is a problem running this in parallel (more than one job) since the docker context is shared (docker-in-docker issues). If this is running on more than one node then i get clashes because of shared resources (ports for example). my workaround is to only limit it to one running instance and queue the rest (not ideal for CI)
For 3 there is a problem mounting volumes since the docker context is shared (docker-in-docker issues). I can not mount the code that I checkout in the job because it is not present on the host that is responsible for running the docker instances that I trigger. my workaround is to build a new image from my template and just copy the code into the new image and then use that for the test (this works, but means I need to use docker cp tricks to get data back out, which is also not ideal)

I think the better way is to use the pure Kubernetes resources to run tests directly by Kubernetes, not by docker-compose.
You can convert your docker-compose files into Kubernetes resources using kompose utility.
Probably, you will need some adaptation of the conversion result, or maybe you should manually convert your docker-compose objects into Kubernetes objects. Possibly, you can just use Jobs with multiple containers instead of a combination of deployments + services.
Anyway, I definitely recommend you to use Kubernetes abstractions instead of running tools like docker-compose inside Kubernetes.
Moreover, you still will be able to run tests locally using Minikube to spawn the small all-in-one cluster right on your PC.

How to automate application deployment when using LXD containers?

How should applications be scripted/automatically deployed when in LXD containers?
For example is best way to deploy applications in LXD containers to use a bash script (which deploys an application)? How to execute this bash script inside the container by executing a command on the host?
Are there any tools/methods of doing this in a similar way to Docker recipes?

In my case, I use Ansible to:
build the LXD containers (web, database, redis for example).
connect to the containers and deploy the services and code needed.
you can build your own images for example with the services and/or code already deployed and build specific containers from this images.
I was doing this from before LXD had Ansible support (Ansible 2.2) i prefer to use ssh instead of lxd connection, when i connect to the containers to deploy services/code. they comes with a profile where i had setup my ssh public key (to have direct ssh connection by keys ... no passwords)

Take a look at my open source project on bitbucket devops_lxd_containers It includes:
Scripts to build lxd image templates including Apache, tomcat, haproxy.
Scripts to demonstrate custom application image builds such as Apache hosting and key/value content and haproxy configured as a router.
Code to launch the containers and map ports so they are accessible to the larger network
Code to configure haproxy as layer 7 proxy to route http requests between boxes and containers based on uri prefix routing. Based on where it previously deployed and mapped ports.
At the higher level it accepts a data drive spec and will deploy an entire environment compose of many containers spread across many hosts and hook them all up to act as a cohesive whole via a layer 7 proxy.
Extensive documentation showing how I accomplished each major step using code snippets before automating.
Code to support zero-outage upgrades using the layer7 ability to gracefully bleed off old connections while accepting new connections at the new layer.
The entire system is built on the premise that image building is best done in layers. We build a updated Ubuntu image. From it we build a hardened Ubuntu image. From it we build a basic Apache image. From it we build an application specific image like our apacheKV sample. The goal is to never rebuild any more than once and to re-use the common functionality such as the basicJDK as the source for all JDK dependent images so we can avoid having duplicate code in any location. I have strived to keep Image or template creation completely separate from deployment and port mapping. The exception is that I could not complete creation of the layer 7 routing image until we knew everything about how other images would be mapped.

I've been using Hashicorp Packer with the ansible provisioner using ansible_connection = lxd
Some notes here for constructing a template
When iterating through local files on your host system you may need to be using ansible_connection = local (e.g for stat & friends)
Using local_action in ansible with the lxd connection is still
action inside the container when using stat (but not with include_vars & lookup function for files)
Using lots of debug messages in Ansible is helpful to know which local environment ansible is actually operating in.

I'm surprised no one here mentioned Canonicals own tool for managing LXD.
https://juju.is
it is super simple, well supported, and the only caveat is it requires you turn off ipv6 at the LXD/LXC side of things (in the network bridge)
snap install juju --classic
juju bootstrap localhost
from there you can learn about juju models, deploy machines or prebaked images like ubuntuOS
juju deploy ubuntu

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart