now ,I get the value from Zabbix Server.The value include IOPS,IO Response Time,and so on.But I want to monitor whether the value is in the range allowed.How to do it.
demo:
items [A]
name Avg Disk sec/Read Response Time
key perf_counter[\234(_Total)\208]
now new items [B] monitor items[A]'s Value, if Value < 12 ms items[B]'s Value is Good,else items[B]'s Value is bad.
In a word,i only want to monitor the disk io response time is healthy or not.
update:
items[A] : monitor windows disk
items[B] : another items is uesed to monitor items[A]'s Value
I want to show io health in charts,instead of setting the trigger
I didn't quite get what's item "[B]", but in general, thresholds (or more complicated problem conditions) in Zabbix are specified using triggers.
For example, in your case the trigger expression could be {host:perf_counter[\234(_Total)\208].last()}>12. That is assuming your item holds data in ms. If other units are used, adjust the threshold accordingly.
This example uses the most simple trigger function, last(). See the trigger function list in the Zabbix manual for an idea of other, more fancy trigger expressions you could construct.
Related
If I send a gauge to Prometheus then the payload has a timestamp and a value like:
metric_name {label="value"} 2.0 16239938546837
If I query it on Prometheus I can see a continous line. Without sending a payload for the same metric the line stops. Sending the same metric after some minutes I get another continous line, but it is not connected with the old line.
Is this fixed in Prometheus how long a timeseries last without getting an update?
I think the first answer by Marc is in a different context.
Any timeseries in prometheus goes stale in 5m by default if the collection stops - https://www.robustperception.io/staleness-and-promql. In other words, the line stops on graph (or grafana).
So if you resume the metrics collection again within 5 minutes, then it will connect the line by default. But if there is no collection for more than 5 minutes then it will show a disconnect on the graph. You can tweak that on Grafana to ignore drops but that not ideal in some cases as you do want to see when the collection stopped instead of giving the false impression that there was continuous collection. Alternatively, you can avoid the disconnect using some functions like avg_over_time(metric_name[10m]) as needed.
There is two questions here :
1. How long does prometheus keeps the data ?
This depends on the configuration you have for your storage. By default, on local storage, prometheus have a retention of 15days. You can find out more in the documentation. You can also change this value with this option : --storage.tsdb.retention.time
2. When will I have a "hole" in my graph ?
The line you see on a graph is made by joining each point from each scrape. Those scrape are done regularly based on the scrape_interval value you have in your scrape_config. So basically, if you have no data during one scrape, then you'll have a hole.
So there is no definitive answer, this depends essentially on your scrape_interval.
Note that if you're using a function that evaluate metrics for a certain amount of time, then missing one scrape will not alter your graph. For example, using a rate[5m] will not alter your graph if you scrape every 1m (as you'll have 4 other samples to do the rate).
I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.
I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.
I am trying to backprocess data in Kafka topics using a Kafka Streams application that involves a join. One of the streams to be joined has much larger volume of data per unit of time in the corresponding topic. I would like to control the consumption from the individual topics so that I get roughly the same event timestamps from each topic in a single consumer.poll(). However, there doesn't appear to be any way to control the behavior of the KafkaConsumer backing the source stream. Is there any way around this? Any insight would be appreciated.
Currently Kafka cannot control the rate limit on both Producers and Consumers.
Refer:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
But if you are using Apache Spark as the stream processing platform, you can limit the input rate for the Kafka receivers.
in the consumer side you can use consume([num_messages=1][, timeout=-1])
function instead of poll.
consume([num_messages=1][, timeout=-1]):
Consumes a list of messages (possibly empty on timeout). Callbacks may be executed as a side effect of calling this method.
The application must check the returned Message object’s Message.error() method to distinguish between proper messages (error() returns None) and errors for each Message in the list (see error().code() for specifics). If the enable.partition.eof configuration property is set to True, partition EOF events will also be exposed as Messages with error().code() set to _PARTITION_EOF.
num_messages (int) – The maximum number of messages to return (default: 1).
timeout (float) – The maximum time to block waiting for message, event or callback (default: infinite (-1)). (Seconds)
It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults which will then emit nothing if the window was empty.
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection with a heartbeat PCollection consisting of empty values. So for the example of Sum.integersGlobally, you would Flatten your main PCollection<Integer> with a secondary PCollection<Integer> that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows or SlidingWindows; Sessions are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource that accepts an enumerable WindowFn plus a default value and generates an appropriate unbounded PCollection.