I'm trying to get a deeper view on my dataflow jobs by measuring parts of it using Metrics.counter & Metrics.gauge but I cannot find them on Stackdriver.
I have a premium Stackdriver account and I can see those counters under the Custom Counters section on the Dataflow UI.
I can see droppedDueToLateness 'custom' counter though on Stackdriver that seems to be created via Metrics.counter as well...
Aside from that, there's something that could be helpful that is that when I navigate https://app.google.stackdriver.com/services/dataflow the message I get is this:
"You do not have any resources of this type being monitored by Stackdriver." and that's weird as well. As if our Cloud Dataflow wasn't properly connected to Stackdriver, but, on the other hand. Some metrics are displayed and can be monitored such as System Lag, Watermark age, Elapsed time, Element count, etc...
What am I missing?
Regards
Custom metric naming conventions
When defining custom metrics in Dataflow, you have to adhere to the custom metric naming conventions, or they won't show up in Stackdriver.
Relevant snippet:
You must adhere to the following spelling rules for metric label
names:
You can use upper and lower-case letters, digits, underscores (_) in
the names.
You can start names with a letter or digit.
The maximum length of a metric label name is 100 characters.
If you create a metric with
Metrics.counter('namespace', 'name')
The metric shows up in stackdriver as custom.googleapis.com/dataflow/name, so 'name' should adhere to the rules mentioned above. The namespace does not seem to be used by Stackdriver.
Additional: labels
It doesn't seem possible to add labels to the metrics when defined this way. However, the full description of each time series of a metric is a string with the format
'name' job_name job_id transform
So you can aggregate by these 4 properties (+ region and project).
Related
I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.
I have the current use case:
We have a system that computes different response time metrics for messages that we want to insert in InfluxDB. This system writes JSON entries to a file.
We use telegraf with JSON plugin to extract the fields we want and insert into InfluxDB.
So far so good.
But we have an issue with 1 particular information.
The system will emit messages where mId is the Unique identifier, in the below examples we have 2 uuidXXXX and uuidYYYY:
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeExitBus”:endTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
And what we want here is to graph the timeInBus which is equal to “timeWeExitBus-timeWeEnterBus” for each unique mId.
So my questions are:
IMU, uuid would be a field not a tag as it is unlimited, same for timeWeExitBus and timeWeEnterBus which would be numeric fields since we want to use functions on them. And timeInBus would be the measurement. Am I right ?
Is this use case a good one for Influx / Telegraf or are we misusing it for this ? IMU, it doesn’t look like a good use case to try to compute this on telegraf side, but I don’t see how to do it in InfluxDB, I initially thought ELAPSED function could help but I end up thinking it doesn’t work here
If it’s a good use case, could you point me to documentation helping implementing this ?
I have two release data in different time intervals. But I want to plot these two releases in the grafana with same interval time. can this possible to fake the time interval and plot the graph? . Because x-axis default it takes time-series. So i can't go with other parameters.
Please suggest on this.
I've created graphs like this, although it took a bit of work.
To start out, InfluxDB doesn't support timeShift yet as detailed here:
https://github.com/influxdata/influxdb/issues/142
So I used a separate HTTP server called the influxdb-timeshift proxy:
https://github.com/maxsivanov/influxdb-timeshift-proxy
My stack looked like this:
Grafana Dashboard --> influxdb-timeshift-proxy --> InfluxDB
Here are descriptions of the two "-->" in the above schematic:
The --> on the left: I created a Grafana Datasource to point to the
tcp port of the influxdb-timeshift-proxy
The --> on the right: The "influxdb-timeshift-proxy" startup
configuration points to the InfluxDB server.
With this in place, to get the time shifting to happen, the SQL-like statements to InfluxDB need a carefully formatted field 'alias' like this:
"SELECT mean( "meanAT" ) AS shift_855296_seconds" blah blah sql blah.
See the influxdb-timeshift-proxy github page above for syntax details.
With a Grafana dashboard, to get two lines (aka series) on a time series graph, I configure two sql statements. The above represents one SQL from a test 9-10 days ago, then I'd SELECT a different test (my baseline that I ran today) with timeshift of 0:
"SELECT mean( "meanAT" ) AS shift_0_seconds" blah blah sql blah.
So that answers your question, but it is of limited use -- because some poor human has gotta calculate the difference between the test times then dial the result (shift_855296_seconds) into the SQL in the Dashboard.
Why? because out-of-the-box, Grafana Dashboards execute SQL statements that are (mostly) hard-configured into a dashboard.
To get Grafana to execute SQL where the shift alias is dynamically generated, I
wrote a Grafana Scripted dashboard in javascript. Here are the high-level instructions for scripted dashboards:
http://docs.grafana.org/reference/scripting/
FYI, Grafana scripted dashboards are poorly documented and the 'development environment' for debugging is primitive, at best, and I was unable to get the javascript 'require' thingy (that includes 3rd party libraries) to work. But there is limited help on the Grafana discussion board and it does actually work -- creating a very nice time shifting dashboard on the fly is possible.
The http URLs to launch/display the 'scripted dashboard' can easily be embedded in some other dashboard you create. Just add your scripted dashboard URL to a "Text Panel" using markdown:
http://docs.grafana.org/features/panels/text/
Ultimately, the influxdb-timeshift-proxy is a stop-gap solution.
I have not tried it, but it looks like Kapacitor can also be used to provide the timeshifting, as described here:
https://docs.influxdata.com/kapacitor/v1.3/nodes/shift_node/
--Erik
Do you mean the X-Axis Mode option on the Graph panel?
Not sure if I understand your question correctly.
If you want to just mark your release - you can use Annotations - http://docs.grafana.org/reference/annotations/#influxdb-annotations.
If you want to show dashboard only for specific timeframe - you can encode that in URL with 'from' and 'to' parameters - https://[your-dashboard-url]?from=1488868669245&to=1488873058626
But yes, there's no way how you can put a parameter on X-axis in current Grafana.
In SPSS, when defining the measure of a variable, the usual options are "Scale", "Ordinal", and "Nominal" (see image).
However, when using actual dialog boxes to do analyses, SPSS will often ask us to describe whether the data are "Continuous" or "Categorical". E.g., I was watching this video by James Gaskin (a great YouTube teacher by the way), and saw this dialog box (image below).
My Question: In the second image, you can see that the narrator put some "Ordinal" variables in the "Continuous" box. Is it okay to do that? How come?
For most procedures, the treatment of a variable is determined by how you use it. The measurement level is just a reminder, so you can treat a variable however it makes sense.
There are some procedures that automatically determine how to treat a variable based on the measurement level, including CTABLES, the Chart Builder, and TREE, but you can change the level temporarily in the dialog box or in syntax or change it persistently via VARIABLE LEVEL or in the Data Editor. Also, most of the statistical extension commands use the declared measurement level to determine whether a variable is continuous or a factor.
I am having a customized sink extending FileBasedSink to which I write to by calling PCollection.apply(Write.to(MySink)) in dataflow (very simpler to XmlSink.java). However it seems by default simply calling Write.to will always result to 3 output shards? Is there any way that I could define the number of output shard (like TextTO.Write.withNumShards) just in customized sink class definition? or I have to define another customized PTransformer like TextIO.Write?
Unfortunately, right now FileBasedSink does not support specifying the number of shards.
In practice, the number of shards you get will be dependent on how the framework chooses to optimize the parts of the pipeline producing the collection you're writing, so there's essentially no control over that.
I've filed a JIRA issue for your request so you can subscribe to the status.