Zabbix LLD Prometheus format metrics - parsing

I am trying to parse some Prometheus format metrics by Zabbix discovery rule.
I have metrics like that:
asg_instance_metadata{id="i-***", tag="ASG_DESIRED_NUM"} 3
asg_instance_metadata{id="i-***", tag="ASG_MAX_SIZE"} 10
asg_instance_metadata{id="i-***", tag="ASG_MIN_SIZE"} 3
asg_instance_metadata{id="i-***", tag="alpha.eksctl.io/nodegroup-type"} unmanaged
asg_instance_metadata{id="i-***", tag="aws:ec2launchtemplate:id"} lt-***
I use "Prometheus to JSON" preprocessing for discovery rule.
In case of first three lines - it is parsing and converting to JSON without any problems, but next metrics can't be parsed, i am getting next error:
cannot convert Prometheus data to JSON: data parsing error at row 4
"asg_instance_metadata{id="i-***", ta...": cannot parse
metric value
In these Zabbix docs
Link1
Link2
I see examples only with INT values.
So can someone help me? Is it possible to use string values in prometheuse format metrics?
My discovery rule configuration:

Related

How to use FluentD as a buffer between Telegraf and InfluxDB

Is there any way to ship metrics gathered form Telegraf to FluentD, then into InfluxDB?
I know it's possible to write data from FluentD into InfluxDB; but how does one ship data from Telegraf into FluentD, basically using use FluentD as a buffer (as opposed to using Kafka or Redis)?
While it might be possible to do with FluentD using some of the available, although outdated output plugins, such as InfluxDB-Metrics, I couldn't get the plugin to work properly and it hasn't been updated in over six years, so it will probably not work with newer releases of FluentD.
Fluent Bit however, has an Influxdb output built right into it, so I was able to get it to work with that. The caveat is that it has no Telegraf plugin. So the solution I found was to setup a tcp input plugin in Fluent Bit, and set Telegraf to write JSON formatted data to it in it's output section.
The caveat of doing this, is that the JSON data is nested and not formatted properly for InfluxDB. The workaround is to use nest filters in Fluent Bit to 'lift' the nested data format, and re-format properly for InfluxDB.
Below is an example for disk-space, which is not a metric that is natively supported with Fluent Bit metrics but is natively supported with Telegraf:
#SET me=${HOST_HOSTNAME}
[INPUT] ## tcp recipe ## Collect data from telegraf
Name tcp
Listen 0.0.0.0
Port 5170
Tag telegraf.${me}
Chunk_Size 32
Buffer_Size 64
Format json
[FILTER] ## rename the three tags sent from Telegraf to prevent duplicates
Name modify
Match telegraf.*
Condition Key_Value_Equals name disk
Rename fields fieldsDisk
Rename name nameDisk
Rename tags tagsDisk
[FILTER] ## un-nest nested JSON formatted info under 'field' tag
Name nest
Match telegraf.*
Operation lift
Nested_under fieldsDisk
Add_prefix disk.
[FILTER] ## un-nest nested JSON formatted info under 'disk' tag
Name nest
Match telegraf.*
Operation lift
Nested_under tagsDisk
Add_prefix disk.
[OUTPUT] ## output properly formatted JSON info
Name influxdb
Match telegraf.*
Host influxdb.server.com
Port 8086
#HTTP_User whatever
#HTTP_Passwd whatever
Database telegraf.${me}
Sequence_Tag point_in_time
Auto_Tags On
NOTE: This is just a simple awkward config for my own proof of concept

Graphite has null values between data points

I have an API that fetches data packets from different servers. It formats this data to different small JSON units. I wrote an algorithm that sends them to graphite with the command json2graphite.
The sending works very well, the incoming data doesn't look bad either.
Now the problem:
The data displayed in graphite shows that each entry is followed by a null.
The data points that should be connected
I am aware that this data can also be connected using a function provided by the Graphite interface, but this doesn't help because Grafana boards always jump back and forth between value and null.
Is there a way to tell Grafana that it only goes to null if there was no data for more than 1 min or so?
I already tried to fix the problem with the data from "storage-schemas.conf" and "storage-aggregation.conf". Unfortunately without success.
storage-schemas.conf:
[default_1min_for_1day]
pattern = .*
retentions = 10s:6h,30s:8d,1m:31d,10m:1y,1h:5y
aggregation.conf:
[default_average]
pattern = .*
xFilesFactor = 0
aggregationMethod = average
If you want to know any more, ask me. : )
Grafana has an option to connect datapoints that are separated by nulls. You can see how to enable this in the screenshot shown under Display Styles settings on Grafana's documentation.
In Graphite composer you can also do it by specifying the connected line mode under Graph options here:
Additionally, you could use Graphite's keepLastValue function to carry the last received value over gaps where there are nulls.
I haven't found a direct solution but I will now try to minimize the interval between the entries. I noticed that the requests take much too long: 2-5 minutes.
There are probably too many servers, so the requests block the port too long.
The problem is not solved yet but I think I will mark it as solved if nobody says I have the problem within 5 days.

Can Telegraf combine/add value of metrics that are per-node, say for a cluster?

Let's say I have some software running on a VM that is emitting two metrics that are fed through Telegraf to be written into InfluxDB. Let's say the metric are no. successfully handled HTTP requests (S), and no. of failed HTTP requests (F), on that VM. However, I might configure three such VMs each emitting those 2 metrics.
Now, if I would like to have a computed metric which is the sum of S from each VM, and sum of F from each VM, and store as new metrics, at various instants of time. Is this something that can be achieved using Telegraf ? Or is there a better, more efficient, more elegant way ?
Kindly note that my knowledge of Telegraf and InfluxDB are theoretical, as I've recently started reading up about them, so I have not actually tried any of the above, yet.
This isn't something telegraf would be responsible for.
With Influx 1.x, you'd use a TICKScript or Continuous Queries to calculate the sum and inject the new sampled value.
Roughly, this would look like:
CREATE CONTINUOUS QUERY "sum_sample_daily" ON "database"
BEGIN
SELECT sum("*") INTO "daily_measurement" FROM "measurement" GROUP BY time(1d)
END
CQ docs

How to transform "Tags Values" in Telegraf

How can I transform the Tag Values in Telegraf?
I am trying to import Web access logs into InfluxDB with Telegraf. However, some of the URL PATHs include identifiers (session IDs, product IDs, etc).
I need to search and aggregate per path type (ids excluded), therefore, I can't(?) have them vary like that.
In the input plugin "logparser" I can use a grok extraction pattern but I can't do transformations of the values extracted that I know of.
And the only processor plugin (in between Input and Output) is merely a "printer".
I can't find any clean way of doing this with Telegraf. Maybe I could do some gymmics with Telegraf (multiple Grok parsers + ex/inclusions?) but after some quite extensive attempts I didn't manage to make anything work - it appeared quite fiddly.
This is only half an answer but:
I managed to achieve what I was trying with LogStash instead, outputting to InfluxDB (LogStash has its own output plugin to InfluxDB). Not as desirable, since now I'm having to run both Telegraf + LogStash but it's working.
I've created a feature request on Telegraf's GitHub:
https://github.com/influxdata/telegraf/issues/2667

Reading large gzip JSON files from Google Cloud Storage via Dataflow into BigQuery

I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.

Resources