How to get offsets from Kafka partitions using erlang brod - erlang

Basically, that's the question:
is there a way to get committed offsets and Kafka's offsets for each partition if using brod group subscriber?
I'm using brod v3.16.2

Related

Duplicate entries in prometheus

I'm using the prometheus plugin for Jenkins in order to pass data to the prometheus server and subsequently have it displayed in grafana.
With the default setup I can see the metrics at http://:8080/prometheus
But in the list I also find some duplicate entries for the same job
default_jenkins_builds_duration_milliseconds_summary_sum{jenkins_job="spring_api/com.xxxxxx.yyy:yyy-web",repo="NA",} 217191.0
default_jenkins_builds_duration_milliseconds_summary_sum{jenkins_job="spring_api",repo="NA",} 526098.0
Both entries refer to the same jenkins job spring_api. But the metrics have different value. Why do I see two entries for the same metric?
Possibly one is a a subset of the other.
In the kubernetes world you will have the resource consumption for each container in a pod ,and the pod's overall resource usage.
Suppose I query the metric "container_cpu_usage_seconds_total" for {pod="X"}.
Pod X has 2 containers so I'll get back four metrics.
{pod="X",container="container1"}
{pod="X",container="container2"}
{pod="X",container="POD"} <- some weird "pause" image with very low usage
{pod="X"} <- sum of container1 and container2
There might also be a discrepancy where the metrics with no container is greater than the sum of the container consumption. That might be some "not accounted for" overhead, like maybe pod dns lookups or something. I'm not sure.
I guess my point is that prometheus will often use combinations of labels and omissions of labels to show how a metric is broken down.

Can Telegraf combine/add value of metrics that are per-node, say for a cluster?

Let's say I have some software running on a VM that is emitting two metrics that are fed through Telegraf to be written into InfluxDB. Let's say the metric are no. successfully handled HTTP requests (S), and no. of failed HTTP requests (F), on that VM. However, I might configure three such VMs each emitting those 2 metrics.
Now, if I would like to have a computed metric which is the sum of S from each VM, and sum of F from each VM, and store as new metrics, at various instants of time. Is this something that can be achieved using Telegraf ? Or is there a better, more efficient, more elegant way ?
Kindly note that my knowledge of Telegraf and InfluxDB are theoretical, as I've recently started reading up about them, so I have not actually tried any of the above, yet.
This isn't something telegraf would be responsible for.
With Influx 1.x, you'd use a TICKScript or Continuous Queries to calculate the sum and inject the new sampled value.
Roughly, this would look like:
CREATE CONTINUOUS QUERY "sum_sample_daily" ON "database"
BEGIN
SELECT sum("*") INTO "daily_measurement" FROM "measurement" GROUP BY time(1d)
END
CQ docs

Get total network traffic between all nodes in a cluster

I'm working in an docker overlay network with six nodes. I would like to measure the total network traffic between all nodes. I came across iftop but it only counts the bytes between the local machine and each node like:
node0(local)<->node1
node0(local)<->node2
...
but not:
node1<->node2
...
I had to install iftop on each node and even then I had to exclude the following connection because it was already counted above.
node1(local)<->node0
...
Or I had to sum up all total TX or RX values on each node. Additionally I had to start iftop on each node at the same time and pause it when my I see my specified process has finished. Is there an easier way so that I can simply start a record on any host and stop the recording to get the total bytes for this period?

How do I get the number of running instances per docker swarm service as a prometheus metric?

For me it seems impossible to get a reliable metric containing all services and their container states (and count).
Using the "last seen" from cadvisor does not work - it is unreliable; there are some open bugs... Using the docker metric I only get the number of total instances running, stopped,...
Does anyone have an idea?
May be below query can help ..
count(count(container_tasks_state{container_label_com_docker_swarm_service_name=~".+", container_label_com_docker_swarm_node_id=~"$node_id"}) by (container_label_com_docker_swarm_service_name))
Use above query in Grafana, prometheus being datasource.

Setting frequecy for Camus jobs

I have just started with Camus. I am planning to run camus job every hour. We get ~80000000 messages (with ~4KB avg size) every hour.
How do I set the following properties:
# max historical time that will be pulled from each partition based on event timestamp
kafka.max.pull.hrs=1
# events with a timestamp older than this will be discarded.
kafka.max.historical.days=3
I am not able to make out these configurations clearly. Should I put days as 1 and and hours property as 2?
How does camus pull the data? Often I see the following error also:
ERROR kafka.CamusJob: Offset range from kafka metadata is outside the previously persisted offset
Please check whether kafka cluster configuration is correct. You can also specify config parameter: kafka.move.to.earliest.offset to start processing from earliest kafka metadata offset.
How do I set the configurations correctly to run every hour and avoid that error?
"Offset range from kafka metadata is outside the previously persisted offset ."
Indicates that your fetching is not as fast as the kafka's pruning.
kafka's pruning is defined by log.retention.hours.
1st option :Increase the retention time by changing "log.retention.hours"
2nd Option :Run it with higher frequency.
3rd Option :Set in your camus job kafka.move.to.earliest.offset=true.
This property will force camus to start consuming from the earliest offset currently present in the kafka. But this may lead to data loss since we are not accounting for the pruned data which we were not able to fetch.

Resources