I've got a dataflow job that pulls messages off of several Google Pub/Sub topics, does some parallel processing on the individual elements contained within those messages, then passes the collection on for further consumption by various resources. I'd like to put together a Stackdriver dashboard showing how many individual elements have been processed for each topic. Each ParDo step outputs a PCollection.
I've set up a dashboard using ElementCount, but I'm only able to filter by job, not by step. If I mouseover the line in the chart produced using ElementCount, I can see counts for every single step. Indeed, it does appear that the metrics for these are being reported, as I can use the gcloud commandline utility in the following manner:
gcloud beta dataflow metrics list [jobid] --filter ElementCount
...
name:
context:
original_name: extract_value_topic_1/Map-out0-ElementCount
output_user_name: extract_value_topic_1/Map-out0
name: ElementCount
origin: dataflow/v1b3
scalar: 7000
updateTime: '2017-05-03T18:13:22.804Z'
---
name:
context:
original_name: extract_value_topic_2/Map-out0-ElementCount
output_user_name: extract_value_topic_2/Map-out0
name: ElementCount
origin: dataflow/v1b3
scalar: 12000
updateTime: '2017-05-03T18:13:22.804Z'
I have several of these, but I don't see a straightforward way of building Stackdriver charts based on them (aside from logging to the console for every element processed then using that to generate a log-based metric, but that seems like it'd be extremely inefficient on a number of levels.) Am I missing something? How would one create a chart based on these ElementCounts?
Edit: Additionally, if I open up the Metrics Explorer I can enter dataflow/job/element_count into the search box then pcollection into the filter box, but I'm unable to build a dashboard with this chart in it as the filter selection in the dashboard chart builder does not allow for filtering by pcollection.
Unfortunately, you currently cannot build a dashboard with a filter on a metric label. As you noticed, the new (Beta) Metric Explorer provides the filtering functionality and the Stackdriver team is actively working on providing that functionality to the dashboard charts as well.
I will follow up if I receive any further updates or details from the Stackdriver team.
--Andrea
Related
I have Prometheus setup on AWS EC2. I have 11 targets configured which have 2+ endpoints. I would like to setup a endpoint/query etc to gather all the metrics in one page. I am pretty stuck right now. I could use some help thanks:)
my prometheus targets file
Prometheus adds an unique instance label per each scraped target according to these docs.
Prometheus provides an ability to select time series matching the given series selector. For example, the following series selector selects time series containing {instance="1.2.3.4:56"} label, e.g. all the time series obtained from the target with the given instance label.
Prometheus provides the /api/v1/series endpoint, which returns time series matching the provided match[] series selector.
So, if you need obtaining all the time series from a particular target my-target, you can issue the following request to /api/v1/series:
curl 'http://prometheus:9090/api/v1/series?match[]={instance="my-target"}'
If you need obtaining metrics from the my-target at the given timestamp, then issue the query with the series selector to /api/v1/query:
curl 'http://prometheus:9090/api/v1/query?query={instance="my-target"}&time=needed-timestamp'
If you need obtaining all the raw samples from the my-target on the given time range (end_timestamp+d ... end_timestamp], then use the following query:
curl 'http://prometheus:9090/api/v1/query?query={instance="my-target"}[d]&time=end_timestamp'
See these docs for details on how to read raw samples from Prometheus.
If you need obtaining all the metrics / series from all the targets, then just use the following series selector: {__name__!=""}
See also /api/v1/query_range - this endpoint is used by Grafana for building graphs from Prometheus data.
So, using Datadog is proving to be quite difficult when attempting to parse some of the data we have.
For example, imagine we have a log that has the properties
Type
Count
Cat
3
Dog
4
How do we get a Datadog dashboard to appreciate this data? I have full control over how the data is added to the log but I don't know how to iterate over it or parse it in a meaningful manner
For example, it may come out like
Animals:[{"Cat":3},{"Dog":4}]
I have four singlestat panels which show my used space on different hosts (every host has also different type_instances):
The query for one of this singlestats is the following:
Question: Is there a way to create a fifth singlestat panel which sows the sum of the other 4 singlestats ? (The sum of all "storj_value" where type=shared)
The influx query language does not currently support aggregations across metrics (eg, JOINs). It is possible with Kapacitor but that requires that new aggregated values for all the measurements are written to the DB, by writing code to do it, which will need to be queried separately.
Only option currently is to use an API that does have cross-metric function support, for example Graphite with an InfluxDB storage back-end, InfluxGraph.
The two APIs are quite different - Influx's is query language based, Graphite is not - and tagged InfluxDB data will need to be configured as a Graphite metric path via templates, see configuration examples.
After that, Graphite functions that act across series can be used, in particular for the above question, sumSeries.
Hi after performing a group by key on a KV Pcollection, I need to:-
1) Make every element in that PCollection a separate individual PCollection.
2) Insert the records in those individual PCollections into a BigQuery Table.
Basically my intention is to create a dynamic date partition in the BigQuery table.
How can I do this?
An example would really help.
For Google Dataflow to be able to perform the massive parallelisation which makes it as one of its kind (as a service on the public cloud), the job flow needs to be predefined before submitting it to on the Google cloud console. Everytime you execute the jar file that conatins your pipleline code (which includes pipeline options and the transforms), a json file with the description of the job is created and submitted to Google cloud platform. The managed service then uses this to execute your job.
For the use case mentioned in the question, it demands that the input PCollection be split into as many PCollections as their are unique dates. For the split, the Tuple Tags needed to split the collection should be created dynamically which is not possible at this time. Creating tuple tags dynamically is not allowed because that doesn't help in creating the job description json file and beats the whole design/purpose with which dataflow was built.
I can think of a couple of solutions to this problem (both having its own pros and cons) :
Solution 1 (a workaround for the exact use case in the question):
Write a dataflow transform that takes the input PCollection and for each element in the input -
1. Checks the date of the element.
2. Appends the date to a pre-defined Big Query Table Name as a decorator (in the format yyyyMMDD).
3. Makes an HTTP request to the BQ API to insert the row into the table with the table name added with a decorator.
You will have to take into consideration the cost perspective in this approach because there is single HTTP request for every element rather than a BQ load job that would have done it if we had used the BigQueryIO dataflow sdk module.
Solution 2 (best practice that should be followed in these type of use cases):
1. Run the dataflow pipeline in the streaming mode instead of batch mode.
2. Define a time window with whatever is suitable to the scenario in which it is being is used.
3. For the `PCollection` in each window, write it to a BQ table with the decorator being the date of the time window itself.
You will have to consider rearchitecting your data source to send data to dataflow in the real time but you will have a dynamically date partitioned big query table with the results of your data processing being near real time.
References -
Google Big Query Table Decorators
Google Big Query Table insert using HTTP POST request
How job description files work
Note: Please mention in the comments and I will elaborate the answer with code snippets if needed.
I'm looking into grouping elements during the flow into batch groups that are grouped based on a batch size.
In Pseudo code:
PCollection[String].apply(Grouped.size(10))
Basically converting a PCollection[String] into PCollection[List[String]] where each list now contains 10 elements. As it is batch and in case it doesn't evenly divide the last batch would contain the left over elements.
I have two ugly ideas with windows and fake timestamps or a GroupBy using keys based on a random index to distribute evenly, but this seems like a to complex solution for the simple problem.
This question is similar to a variety of questions on how to batch elements. Take a look at these to get you started:
Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?
Partition data coming from CSV so I can process larger patches rather then individual lines