I am working on a migration task from an on-premise system to a cloud composer, the thing is that Cloud composer is a fully managed version of airflow which restrict access to file systems behind, actually on my on-premise system I have a lot of environment variables for some paths we're saving them like /opt/application/folder_1/subfolder_2/....
When looking at the Cloud composer documentation, they said that you can access and save your data on the data folder which is mapped by /home/airflow/gcs/data/ which implies that in case I move forward that mapping, I will be supposed to change my env variables values to something like : /home/airflow/gcs/data/application/folder_1/folder_2 things that could be a bit painful, knowing that I'm running many bash scripts that rely on those values.
Is there any approach to solve such problem ?
You can specify your env variables during Composer creation/update process [1]. These vars are then stored in the YAML files that create the GKE cluster where Composer is hosted. If you SSH into a VM running the Composer GKE cluster, then enter one of the worker containers and run env, you can see the env variables you specified.
[1] https://cloud.google.com/composer/docs/how-to/managing/environment-variables
Related
I can't find much information on what the differences are in running Airflow on Google Cloud Composer vs Docker. I am trying to switch our data pipelines that are currently on Google Cloud Composer onto Docker to just run locally but am trying to conceptualize what the difference is.
Cloud Composer is a GCP managed service for Airflow. Composer runs in something known as a Composer environment, which runs on Google Kubernetes Engine cluster. It also makes use of various other GCP services such as:
Cloud SQL - stores the metadata associated with Airflow,
App Engine Flex - Airflow web server runs as an App Engine Flex application, which is protected using an Identity-Aware Proxy,
GCS bucket - in order to submit a pipeline to be scheduled and run on Composer, all that we need to do is to copy out Python code into a GCS bucket. Within that, it'll have a folder called DAGs. Any Python code uploaded into that folder is automatically going to be picked up and processed by Composer.
How Cloud Composer benefits?
Focus on your workflows, and let Composer manage the infrastructure (creating the workers, setting up the web server, the message brokers),
One-click to create a new Airflow environment,
Easy and controlled access to the Airflow Web UI,
Provide logging and monitoring metrics, and alert when your workflow is not running,
Integrate with all of Google Cloud services: Big Data, Machine Learning and so on. Run jobs elsewhere, i.e. other cloud provider (Amazon).
Of course you have to pay for the hosting service, but the cost is low compare to if you have to host a production airflow server on your own.
Airflow on-premise
DevOps work that need to be done: create a new server, manage Airflow installation, takes care of dependency and package management, check server health, scaling and security.
pull an Airflow image from a registry and creating the container
creating a volume that maps the directory on local machine where DAGs are held, and the locations where Airflow reads them on the container,
whenever you want to submit a DAG that needs to access GCP service, you need to take care of setting up credentials. Application's service account should be created and downloaded as a JSON file that contains the credentials. This JSON file must be linked into your docker container and the GOOGLE_APPLICATION_CREDENTIALS environment variable must contain the path to the JSON file inside the container.
To sum up, if you don’t want to deal with all of those DevOps problem, and instead just want to focus on your workflow, then Google Cloud composer is a great solution for you.
Additionally, I would like to share with you tutorials that set up Airflow with Docker and on GCP Cloud Composer.
I have created a simple PHP api application that works with a mysql database to store data. I have been experimenting with Kubernetes on my Windows 10 machine through Minikube.
I have just about got my head round the ideas involved, yet I’m not sure about how to implement this properly. So far I have used Kompose to create a set of yaml files from an existing docker-compose file. This has been half successful.
To get my application code into a pod hosting PHP, I have been using hostPath to share from my local machine. I mount to the minikube machine and share from there. I was having trouble sharing by other means. The application code is hosted in a github repo.
My questions are:
Is mounting my application code into a pod (assuming this is similar to what happens in docker) the correct way to do this? I’m not clear exactly what information is held on an image retrieved from the docker hub. Although I have read up on containers isolating the build environment from your machine.
How does this approach to translate into a production environment hosted on a cloud? I see there are various storage types. I had for example, wanted to try deploying on AWS just to see how this would work in practice.
I’m really looking for guidance to go from the tutorials found on the web working on my machine, to something that could be done for a customer hosted on the cloud. This might scale up to a more microservices style architecture over time.
The approach you are describing is mostly for development setups, where you want to mount your code into the container as a volume so you don't have to rebuild every time your code changes. Typically done with a docker-compose file.
For production setups, you want the docker image to correctly work and only mount volumes to data you want to persist, typically databases are the core example. For this EKS is deeply integrated into the AWS infrastructure and will create EBS volumes on demand. You don't need to provision any volume or even care for most cases (unless you need multiple read-write volumes needed for scaling).
For a PHP application you really should not persist any data in the pod, because it will create other issues when you need to scale the application. Also, a good approach for managing files that need to persist is S3 (AWS simple storage service).
So generally speaking, you need a deployment per application a service to access each pod on that application and then an ingress object to route traffic from the internet to each pod.
Your application docker image is really the core. You just build it with your code inside. Make sure to pass configuration using environment variable or configuration file so you can connect to the database.
Now for kubernetes, for each compoment (e.g. PHP application, MySQL) you will most likely create a deployment k8s manifest that points to the docker image and add some configuration environment variables.
For production, you will need persistence volume. On aws you can simply use EBS-backed volumes
To get traffic from Internet to your PHP application, you will need to add one or more k8s components:
K8s Service manifest that exposes your PHP deployment/pod on a stable address. If you only have q or very few services, you can use LoadBalancer which on cloud like AWS will create an ALB/ELB (might need to add annotation to your service)
An ingress which is just a reverse proxy (contour, nginx, traefik). On cloud environment it will map to an ALB/ELB. The advantage of this is that you can have a single ALB for all your services i.e. save money. Also you can configure routing path or TLS termination in one place.
With Docker, there is discussion (consensus?) that passing secrets through runtime environment variables is not a good idea because they remain available as a system variable and because they are exposed with docker inspect.
In kubernetes, there is a system for handling secrets, but then you are left to either pass the secrets as env vars (using envFrom) or mount them as a file accessible in the file system.
Are there any reasons that mounting secrets as a file would be preferred to passing them as env vars?
I got all warm and fuzzy thinking things were so much more secure now that I was handling my secrets with k8s. But then I realized that in the end the 'secrets' are treated just the same as if I had passed them with docker run -e when launching the container myself.
Environment variables aren't treated very securely by the OS or applications. Forking a process shares it's full environment with the forked process. Logs and traces often include environment variables. And the environment is visible to the entire application as effectively a global variable.
A file can be read directly into the application and handled by the needed routine and handled as a local variable that is not shared to other methods or forked processes. With swarm mode secrets, these secret files are injected a tmpfs filesystem on the workers that is never written to disk.
Secrets injected as environment variables into the configuration of the container are also visible to anyone that has access to inspect the containers. Quite often those variables are committed into version control, making them even more visible. By separating the secret into a separate object that is flagged for privacy allows you to more easily manage it differently than open configuration like environment variables.
Yes , since when you mount the actual value is not visible through docker inspect or other Pod management tools. More over you can enforce file level access at the file system level of the Host for those files.
More suggested reading is here Kubernets Secrets
Secrets in Kearse used to store sensitive information like passwords, ssl certificates.
You definitely want to mount ssl certs as files in container rather sourcing them from environment variab
les.
I would like to be able to test my docker application on local before sending it to the cluster. I want to use mini Kube for this. Meanwhile, instead of having multiple kube config files which would define env variables for the cloud environment and for my local machine, I would like to override some of the env variables when running in local. I can see that you can do something like that with docker compose:
docker-compose up -f docker-compose.yml -f docker-compose.e2e.yml.
The second file would only have the overriding values. Yes, there are two files but I find it clean.
Is there a way to do something similar with Kube/minikube? Or even something better ???
I think you are asking how to pass different environment values into your Pods depending upon which environment they are deployed to. One pattern to achieve this is to deploy with helm. Then you use templated versions of your kubernetes descriptors for deployment. You also have a values.yaml file that contains values to be injected into the descriptors. You can switch and overlay values.yaml files at the time of install to control which values are injected for a given installation.
If you are asking how to switch whether a kubectl command runs against local or cloud without having to keep switching your kubeconfig file then you can add both contexts to your kubeconfig and use kubectl context to switch between them, as #Ijaz Khan suggests
As per kubernetes docs: http://kubernetes.io/docs/user-guide/configmap/
Kubernetes has a ConfigMap API resource which holds key-value pairs
of configuration data that can be consumed in pods.
This looks like a very useful feature as many containers require configuration via some combination of config files, and environment variables
Is there a similar feature in docker1.12 swarm ?
Sadly, Docker (even in 1.12 with swarm mode) does not support the variety of use cases that you could solve with ConfigMaps (also no Secrets).
The only things supported are external env files in both Docker (
https://docs.docker.com/engine/reference/commandline/run/#/set-environment-variables-e-env-env-file) and Compose (https://docs.docker.com/compose/compose-file/#/env-file).
These are good to keep configuration out of the image, but they rely on environment variables, so you cannot just externalize your whole config file (e.g. for use in nginx or Prometheus). Also you cannot update the env file separately from the deployment/service, which is possible with K8s.
Workaround: You could build your configuration files in a way that uses those variables from the env file maybe.
I'd guess sooner or later Docker will add those functionality. Currently, Swarm is still in it's early days so for advanced use cases you'd need to either wait (mid to long term all platforms will have similar features), build your own hack/woraround, or go with K8s, which has that stuff integrated.
Sidenote: For Secrets storage I would recommend Hashicorp's Vault. However, for configuration it might not be the right tool.