Google Cloud Composer vCPU time Confusion - google-cloud-composer

I've been trying Composer recently to run my pipeline, and found it cost surprisingly high than I thought, here is what I got from the bill:
Cloud Composer Cloud Composer vCPU time in South Carolina: 148.749 hours
[Currency conversion: USD to AUD using rate 1.475] A$17.11
Cloud Composer Cloud Composer SQL vCPU time in South Carolina: 148.749 hours
[Currency conversion: USD to AUD using rate 1.475] A$27.43
I only used Composer for two or three days, and definitely not running 24 hours per day, I don't know where the 148 hours come from.
Does that mean after you deploy the dag to composer, even it's not running, it's still using the resource and the composer is accumulating the vCPU time?
How to reduce cost if I want to use Composer to run my pipeline everyday? Thanks.

Cloud Composer primarily charges for compute resources allocated to an environment, because most of its components continue to run even when there are no DAGs deployed. This is because Airflow is primarily a workflow scheduler, so there's not much you can turn off and expect to be there when a workflow is suddenly ready to run.
In your case, the billed vCPU time is contributed to by your environment's GKE nodes, and your managed Airflow database. Aside from the GKE node count, there's not much you can reduce or turn off, so if you need anything smaller, you may want to consider self-managed Airflow or another platform entirely. Same comment applies if your primary objective is solely processing data and you don't need the scheduling aspect that's offered by Airflow.

At the moment, as I am aware of, is not a feature of composer yet.
At worker level, you should be able to do this by manually modifing the configuration of the composer and allow its kubernetes workers to scale up and down according to the workload.
Joshua Hendinata made a guide at the following link on the necessary step for enabling autoscaling of Composer [1].
Also perhaps may be of your interet this article where are introduced ways to save on composer costs [2].
Hope this helps you out!
[1] https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60
[2] https://medium.com/condenastengineering/automating-a-cloud-composer-development-environment-590cb0f4d880

Related

Job-based cloud processing solution

I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you
I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.
Its possible you're looking for something like AWS Batch flows: https://aws.amazon.com/batch/ or google datalflow https://cloud.google.com/dataflow. Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ: https://activemq.apache.org/how-does-a-queue-compare-to-a-topic
NATS: https://docs.nats.io/using-nats/developer/receiving/queues
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like fly.io or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
https://workers.cloudflare.com/
https://cloud.google.com/functions
https://aws.amazon.com/lambda/
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
https://cloud.google.com/tasks
When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (spark.read
.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/path/to/data")
.drop("content")
)
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
...
# return xform status
return "success"
#pandas_udf("string")
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.

How to use delayed_job with Rails on Google App Engine?

This may be more of an App Engine question than a delayed_job question. But generally, how can I keep a long-lived process running to handling the scheduling of notifications and the sending of scheduled notifications on Google App Engine?
The maintainers of active_job https://github.com/collectiveidea/delayed_job include a script for production deploys, but this seems to stop after a few hours. Trying to figure out the best approach to ensure that the script stays running, and also that the script is able to access the logs for debugging purposes.
I believe that Google Pub/Sub is also a possibility, but I would ideally like to avoid setting up additional infrastructure for such a small project.
For running long processes that last for hours, App Engine will not be the ideal solution, since the requests are cap to 60 s (GAE Standard) and 60 m (GAE Flex).
The best would be to use a Compute Engine based solution, since the you would be able to keep the GCE VM up for long periods.
Once you have deployed on your GCE VM a RESTful application you can use Cloud Scheduler to create an scheduled job with this command:
gcloud scheduler jobs create http JOB --schedule=SCHEDULE --uri=APP_PATH
You can find more about this solution in this article
If App Engine is required take into consideration the mentioned maximum request times. And additionally you can give a look to Cloud Tasks, since those fit pretty much into your requirement.

Using Dataflow vs. Cloud Composer

I'd like to get some clarification on whether Cloud Dataflow or Cloud Composer is the right tool for the job, and I wasn't clear from the Google Documentation.
Currently, I'm using Cloud Dataflow to read a non-standard csv file -- do some basic processing -- and load it into BigQuery.
Let me give a very basic example:
# file.csv
type\x01date
house\x0112/27/1982
car\x0111/9/1889
From this file we detect the schema and create a BigQuery table, something like this:
`table`
type (STRING)
date (DATE)
And, we also format our data to insert (in python) into BigQuery:
DATA = [
("house", "1982-12-27"),
("car", "1889-9-11")
]
This is a vast simplification of what's going on, but this is how we're currently using Cloud Dataflow.
My question then is, where does Cloud Composer come into the picture? What additional features could it provide on the above? In other words, why would it be used "on top of" Cloud Dataflow?
Cloud composer(which is backed by Apache Airflow) is designed for tasks scheduling in small scale.
Here is an example to help you understand:
Say you have a CSV file in GCS, and using your example, say you use Cloud Dataflow to process it and insert formatted data into BigQuery. If this is a one-off thing, you have just finished it and its perfect.
Now let's say your CSV file is overwritten at 01:00 UTC every day, and you want to run the same Dataflow job to process it every time when its overwritten. If you don't want to manually run the job exactly at 01:00 UTC regardless of weekends and holidays, you need a thing to periodically run the job for you (in our example, at 01:00 UTC every day). Cloud Composer can help you in this case. You can provide a config to Cloud Composer, which includes what jobs to run (operators), when to run (specify a job start time) and run in what frequency (can be daily, weekly or even yearly).
It seems cool already, however, what if the CSV file is overwritten not at 01:00 UTC, but anytime in a day, how will you choose the daily running time? Cloud Composer provides sensors, which can monitor a condition (in this case, the CSV file modification time). Cloud Composer can guarantee that it kicks off a job only if the condition is satisfied.
There are a lot more features that Cloud Composer/Apache Airflow provide, including having a DAG to run multiple jobs, failed task retry, failure notification and a nice dashboard. You can also learn more from their documentations.
For the basics of your described task, Cloud Dataflow is a good choice. Big data that can be processed in parallel is a good choice for Cloud Dataflow.
The real world of processing big data is usually messy. Data is usually somewhat to very dirty, arrives constantly or in big batches and needs to be processed in time sensitive ways. Usually it takes the coordination of more than one task / system to extract desired data. Think of load, transform, merge, extract and store types of tasks. Big data processing is often glued together using using shell scripts and / or Python programs. This makes automation, management, scheduling and control processes difficult.
Google Cloud Composer is a big step up from Cloud Dataflow. Cloud Composer is a cross platform orchestration tool that supports AWS, Azure and GCP (and more) with management, scheduling and processing abilities.
Cloud Dataflow handles tasks. Cloud Composer manages entire processes coordinating tasks that may involve BigQuery, Dataflow, Dataproc, Storage, on-premises, etc.
My question then is, where does Cloud Composer come into the picture?
What additional features could it provide on the above? In other
words, why would it be used "on top of" Cloud Dataflow?
If you need / require more management, control, scheduling, etc. of your big data tasks, then Cloud Composer adds significant value. If you are just running a simple Cloud Dataflow task on demand once in a while, Cloud Composer might be overkill.
Cloud Composer Apache Airflow is designed for tasks scheduling
Cloud Dataflow Apache Beam = handle tasks
For me, the Cloud Composer is a step up (a big one) from Dataflow. If I had one task, let's say to process my CSV file from Storage to BQ I would/could use Dataflow. But if I wanted to run the same job daily I would use Composer.

Continuous delivery of a docker container to Google Cloud

Goal:
I have an application based inside a docker container. I want to be able to use continuous integration for that with push to deploy using Bitbucket Pipelines to Google Cloud. I need access to an SQL database (MariaDB preferably), and some kind of caching system (be it memcache, redis, or something else).
Problem:
I'm entirely unsure of what services I need from said cloud provider to facilitate this as simply as possible, while still being cost effective. I looked into using Google's AppEngine, but I don't know if I'm doing something weird or odd, but for 1 vCPU w/ 1 GB of RAM and 10GB of storage, it was $55/month USD. Which is far more than I really want to pay. But I'm not even sure this is what I need for this. I don't need much power (these are very small apps, used by a very small amount of people). Also, again, not sure if I'm doing something wrong, but I was unable to find a caching solution with Google that wasn't insanely expensive (MemoryStore). Basically, I'm completely overwhelmed by the # of options, and am just looking for a cost effective / simple solution for continuous delivery of a docker application to Google Cloud
Using the tools discussed here , you can set up an end-to-end continuous delivery pipeline covering code, build, test, deploy, and monitor phases of software development across multi-cloud, hybrid, or on-premise environment. Likely this is a way to start.

Run a large amount of tasks on a cluster

I'm looking for a solution to running a large amount of tasks and monitoring their status on a cluster.
In detail: Each task consists of 3-4 processes which are docker contained (each process is a docker run command). All of the processes have to run on the same server.
The amount of tasks we're talking about is bursts of several hundreds of tasks at a time.
I've looked into several solutions all of them based on Mesos:
Chronos - Seems like it would falter under high load and in any case is more directed towards recurring (cron) jobs. While I need one-time (heavy) job.
Custom Mesos FW - Seems to low-level for my needs would require me to write scheduling and retrying mechanisms, I'd save this for last resort.
Aurora - This seems promising as each task is run on the same node and comprised of several processes. I am missing a couple of this here though: Aurora seems to not be able to run several tasks as a part of a single job. Since my tasks are all similar with different input I could use a single job with many (say 400) instances and the first process of each task (whose role is to download the input from S3) could download a different set based on the instance ID. Which brings me to another problem: I can't find a working example of using {{ mesos.instance }} in .aurora files can anyone give me an example?
Thanks for all the fish people
You could also have a look on Kubernetes (which also can be run as a framework in Mesos). Kubernetes has the concept of Pods which are basically a set of co-located containers. So in your case a pod would consist of your 3-4 processes/containers and then these pods can be scaled up/down.
Short comments regarding the other solutions you mentioned:
Chronos: Not really targeting your use case
Custom FW: Actually not so difficult, but good call to save this as last resort.
Aurora: Very powerful but also complex framework
Marathon (which you didn't mention): targeted for long running applications which can be easily scaled up and down.
In addition to the excellent other answer, you could check out Two Sigma's Cook which they have only recently open sourced but have been using in prod at scale for a while.

Resources