I have an issue where I try to monitor with Prometheus and SCDF some short-lived tasks. We are using micrometer and sending the metrics to rsocket proxy that is scraped by prometheus, as shown in the scdf documentation.
The metrics do show up in prometheus but due to the way prometheus works and the fact that rsocket proxy removes the metrics once they are scraped we can't really graph or use the metrics since this task runs at 12 or 24h intervals.
Does anyone have any experience with this and found a way to properly monitor these short-lived tasks that get triggered quite far apart without resorting to pushgateway?
Related
I'd like to use Airflow with Statsd and DataDog to monitor if DAG takes e.g. twice time as its previous execution(s). So, I need some kind of a real-time timer for a DAG (or operator).
I'm aware that Airflow supports some metrics.
However, to my understanding all of the metrics are related to finished tasks/DAGs, right? So, It's not the solution, because I'd like to monitor running DAGs.
I also considered the timeout_execution/SLA features, but they are not suitable for this use-case
I'd like to be notified that some DAG hangs, but I don't want to kill it.
There are a number of different ways you could handle this:
In the past I've configured a telemetry DAG which would collect the current state of all tasks/DAGs by querying the metadata tables. I'd collect these metrics and push them up to CloudWatch. This became problematic as these internal fields change often so we would run into issues when trying to upgrade to newer versions of Airflow.
There are also some well-maintained Prometheus exporters that some companies have open sourced. By setting these up you could poll the exposed export path as frequently as you wanted to (DataDog supports Prometheus).
These are just some of your options. Since the Airflow webserver is just a Flask app you can really expose metrics in whatever way you see fit.
As I understand, you can monitor running tasks in DAGs using DataDog, refer the integration with Airflow docs
You may refer metrics via DogStatD docs. Also, look at this page would be useful to understand what to monitor.
E.g., the metrics as below:
airflow.operator_failures: monitor the failed operator.
airflow.operator_successes: monitor succeed operator.
airflow.dag_processing.processes: Number of currently running DAG parsing (processes).
airflow.scheduler.tasks.running : Number of tasks running in executor
Shown as task.
We have been using spring batch for below use cases
Read data from file, process and write to target database (batch
kicks off when file arrives)
Read data from remote database, process and write to target database (runs on scheduled interval, triggered
by Autosys)
With the plan to move all online apps to spring-boot microservices and PCF, we are looking at doing a similar excercise on the batch side if it adds value.
In the new world, the spring cloud batch job task will be reading the file from S3 storage (ECSS3).
I am looking at good design here (stay away from too many pipes/filters and orchestration if possible), the input data ranges from 1MM to 20MM records
ECSS3 will notify on file arrival by sending an http request, the
workflow would be - clould stram httpsource->launch clould batch job task that will read from object store, process and save records to target database
Spring Clould Job Task triggered from PCF scheduler to read from remote database, process and save to target database
With the above design, I don't see the value of wrapping the spring batch job into clould task and running in the PCF with spring data flow
Am I missing something here ? Is PCF/SpringClouldDataFlow an overkill in this case ?
Orchestrating batch-jobs in a cloud setting could bring new benefits to the solution. For instance, the resiliency model that PCF supports could be useful. Spring Cloud Task (SCT) are typically run in a short-lived container; if it goes down, PCF will bring it back up and run in it.
Both the options listed above are feasible and it comes down to the use-case wrt the frequency in which you're processing the incoming data. It is really real-time or it can happily run on a schedule is something you'd have to determine to make the decision.
As for the applicability of Spring Cloud Data Flow (SCDF) + PCF, again, it comes down to your business requirements. You may not be using it now, but Spring Batch Admin is EOL in favor of SCDF's Dashboard. The following questions might help realize the SCDF + SCT value proposition.
Do you have to monitor the overall batch-jobs' status, progress, and health? Maybe you've requirements to assemble multiple batch-jobs as a DAG? How about visually composing a series of Tasks and orchestrate it entirely from the Dashboard?
Also, when the batch-jobs are used together with SCT, SCDF, and PCF Scheduler, you'd get the benefit to monitoring all of this from the PCF Apps Manager.
Normally we run the Jar of spring cloud data flow in one of the machine, but what if over the period we create many flows on the machine and the server gets overloaded and becomes a single point of failure, Do we have some thing where we can run the spring cloud data flow server jar on another machine and shift the flows on to that so that we can avoid any such failures and make our complete system more resilient and robust. or does the expansion happen automatically when we deploy our complete system on PCF/or cloud foundry.
SCDF is a simple Boot application. It doesn't retain any state about the stream/task applications itself, but it does keep track of the DSL definitions in the database.
It is common to provision multiple instances of SCDF-server and a load balancer in front for resiliency.
In PCF specifically, if you scale the SCDF-server to >1, PCF will automatically load-balance the incoming traffic (from SCDF Shell/GUI). It is also important to note that PCF will automatically restart the application instance, if it goes down for any reason. You will be set up for multiple levels of resiliency this way.
I have a RoR application with installed Prometheus-client and Telegraf daemon with Prometheus input plugin working on the instance I want to monitor.
As far as I understand I need some kind of exporter middleware to collect metrics from Prometheus::Client.registry and expose them with /metrics HTTP endpoint.
What I don't really understand is how to pass all metrics from different envs (e.g from rake task and app's runtime code) into the same registry (it's an instance variable of Prometheus::Middleware::Exporter.new(registry)) of the same instance of Prometheus::Middleware::Exporter middleware?
Also, will urls = ["http://localhost:3000/metrics"] config of Prometheus input plugin for Telegraf work on EC2 instance for example?
Thank you for advices.
First of all, it's not recommended to use a catch-all exporter like Telegraf. You can read about some of the arguments in this blog post: https://www.robustperception.io/one-agent-to-rule-them-all/.
Then, if I understand your question correctly, it's not possible to use the same registry from multiple processes (like your Rails app and some rake task). Your Rails app will export its own metrics and you'll need to use a different approach for rake tasks.
As rake tasks are (usually) short-lived processes, they are not well suited to be pulled from. You have two options here, either you use the Pushgateway and the PGW-support in client_ruby to push all relevant metrics at the end of the rake task execution (like how long it took, how many items were processed, if there was any error, etc.). Alternatively, you can use the textfile collector in the node_exporter and write your metrics to disk at the end of your rake task execution. The node_exporter will then read that file and export the metrics when it gets scraped.
I don't actively monitor stackoverflow, you'll get more help with these questions on the prometheus-users mailing list, see https://prometheus.io/community/.
Perhaps an easier way to go would be to setup a Telegraf client on the same host (with Prometheus output and statsd input) and then fire events from your application into Telegraf's input, in statsd format. Telegraf would then turn around and emit these metrics in Prometheus's format.
in this way you'll get both Telegraf's host-level metrics (free memory, disc usage, etc) AND your application's metrics, all exported in the same port. It doesn't require any Ruby-specific code, just the ability to fire UDP messages from your app into a local port.
We are currently using Spring batch - remote chunking for scaling batch process . Thinking of using Cloud data flow but would like to know if based on load Slaves can be dynamically provisioned?
we are deployed in Google Cloud and hence want to think of using Spring Cloud data flow support for kubernetes as well if Cloud data flow would fit our needs ?
When using the batch extensions of Spring Cloud Task (specifically the DeployerPartitionHandler), workers are dynamically launched as needed. That PartitionHandler allows you to configure a maxiumum number of workers, then it will process each partition as an independent worker up to that max (processing the rest of the partitions as others finish up). The "dynamic" aspect is really controlled by the number of partitions returned by the Partitioner. The more partitions returned means the more workers launched.
You can see a simple example configured to use CloudFoundry in this repo: https://github.com/mminella/S3JDBC The main difference between it and what you'd need is that you'd swap out the CloudFoundryTaskLauncher for a KubernetesTaskLauncher and it's appropriate configuration.