cloud composer spark submit existing Hadoop cluster - google-cloud-composer

I'm trying to use cloud composer at few days
But, I has a mission that spark submit to our existing Hadoop cluster with yarn mode
Is it possible using by cloud composer?

Anything that can be done by Apache Airflow 1.9.0 can be done by Cloud Composer currently. In this case, maybe take a look at the SparkSubmitOperator ?

Related

Airflow on Google Cloud Composer vs Docker

I can't find much information on what the differences are in running Airflow on Google Cloud Composer vs Docker. I am trying to switch our data pipelines that are currently on Google Cloud Composer onto Docker to just run locally but am trying to conceptualize what the difference is.
Cloud Composer is a GCP managed service for Airflow. Composer runs in something known as a Composer environment, which runs on Google Kubernetes Engine cluster. It also makes use of various other GCP services such as:
Cloud SQL - stores the metadata associated with Airflow,
App Engine Flex - Airflow web server runs as an App Engine Flex application, which is protected using an Identity-Aware Proxy,
GCS bucket - in order to submit a pipeline to be scheduled and run on Composer, all that we need to do is to copy out Python code into a GCS bucket. Within that, it'll have a folder called DAGs. Any Python code uploaded into that folder is automatically going to be picked up and processed by Composer.
How Cloud Composer benefits?
Focus on your workflows, and let Composer manage the infrastructure (creating the workers, setting up the web server, the message brokers),
One-click to create a new Airflow environment,
Easy and controlled access to the Airflow Web UI,
Provide logging and monitoring metrics, and alert when your workflow is not running,
Integrate with all of Google Cloud services: Big Data, Machine Learning and so on. Run jobs elsewhere, i.e. other cloud provider (Amazon).
Of course you have to pay for the hosting service, but the cost is low compare to if you have to host a production airflow server on your own.
Airflow on-premise
DevOps work that need to be done: create a new server, manage Airflow installation, takes care of dependency and package management, check server health, scaling and security.
pull an Airflow image from a registry and creating the container
creating a volume that maps the directory on local machine where DAGs are held, and the locations where Airflow reads them on the container,
whenever you want to submit a DAG that needs to access GCP service, you need to take care of setting up credentials. Application's service account should be created and downloaded as a JSON file that contains the credentials. This JSON file must be linked into your docker container and the GOOGLE_APPLICATION_CREDENTIALS environment variable must contain the path to the JSON file inside the container.
To sum up, if you don’t want to deal with all of those DevOps problem, and instead just want to focus on your workflow, then Google Cloud composer is a great solution for you.
Additionally, I would like to share with you tutorials that set up Airflow with Docker and on GCP Cloud Composer.

OSSEC_HIDS Kubernetes Deployment

Which would be the best HIDS (HostBase Intrusion Detection System) to deploy on Kubernetes Google Cloud Platform
I want to build docker image on debian:stable-slim
So I have been testing the ossec-docker and wazuh-docker
here are repos respectively:
OSSEC: https://github.com/Atomicorp/ossec-docker
WAZUH: https://github.com/wazuh/wazuh-docker
The wazuh-api=3.7.2-1 is broken as I am unable to get it install on debian:stable-slim
with nodejs: 6.10.0 or higher as it needs nodejs version >=4.6.0
but api is unable to install
I would need to know if anyone can suggest HostBase Intrusion Detection system which I can configure and deploy on docker/ Kubernetes If you have any github repo link would really appreciate the link
Wazuh has a repository for Kubernetes. Right now, it is focused on AWS, but I think you just need to change the volumes configuration (it is implemented for AWS EBS) and it should work in GCP.

Connect external workers to Cloud Composer airflow

Is it possible to connect an external worker that is not part of the Cloud Composer Kubernetes cluster? Use case would be connecting a box in a non-cloud data center to a Composer cluster.
Hybrid clusters are not currently supported in Cloud Composer. If you attempt to roll your own solution on top of Composer, I'd be very interested in hearing what did or didn't work for you.

Does Cloud Composer have failover?

I've read the Cloud Composer overview (https://cloud.google.com/composer/) and documentation (https://cloud.google.com/composer/docs/).
It doesn't seem to mention failover.
I'm guessing it does, since it runs on Kubernetes cluster. Does it?
By failover I mean if the airflow webserver or scheduler stops for some reason, does it get started automatically again?
Yes, since Cloud Composer is built on Google Kubernetes Engine, it benefits from all the fault tolerance of any other service running on Kubernetes Engine. Pod and machine failures are automatically healed.

How to update DAGs in Google Cloud Composer

I want to automate the deployment of DAGs written in a certain repository.
To achieve that I use the gcloud tool and this just imports the DAGs and Plugins according to the documentation.
Now the problem is that when changing the structure of a DAG it is just not possible to get it to load/run correctly in the webinterface. When I use airflow locally I just restart the webserver and everything is fine, however using Cloud Composer I cannot find out how to restart the webserver.
We only support uploading DAGs through GCS currently: https://cloud.google.com/composer/docs/how-to/using/managing-dags
The webserver, which is hosted through GAE, can't be restarted.

Resources