Telegraf - send Error/0/false if input plugin (service) is down - influxdb

I need calculate my service uptime (e.g redis, memcached)
= success fetching metrics attempts / *total* fetching metrics attempts (every 10 sec for some period)
Can I somehow configure Telegraf to send 0/false if my input (service) is down?
Cause now if input-service is down influxdb don't receive any new metrics points from telegraf (only error logs on telegraf daemon side).

Daniel Nelson answered here https://github.com/influxdata/telegraf/issues/4563#issuecomment-413653844
that each plugin can add custom metrics to internal plugin (example: http_listener input plugin)

Related

Collect metrics from a Prometheus server to telegraf

I have a prometheus server running on a K8s instance and telegraf on a different cluster. Is there some way to pull metrics from the prometheus server using telegraf? I know there is telegraf support for scraping metrics from prometheus clients but I am looking to get these metrics from the prometheus server.
Thanks
there is this thing inside data sources, called scrapers, its a tab, you just need to put the url of the server.
I am trying to configure this using cli, but i can only do it with gui
There is a prometheus remote write parser (https://github.com/influxdata/telegraf/tree/master/plugins/parsers/prometheusremotewrite), I think it will be included in the 1.19.0 release of Telegraf. If you want to try it out now you can use a nightly build. (https://github.com/influxdata/telegraf#nightly-builds)
Configure your prometheus remote write towards telegraf and configure the input plugin to listen for traffic on the port that you configured. For convenience sake, have the output plugin configured to file so you can see the metrics in a file, almost immediately

What does it mean when Spark executor results are "sent via BlockManager"?

I have a host running a spark-master along with 3 spark-workers, all in docker containers. I have another host acting as a Spark-driver, reading data from the first host.
I am able to successfully retrieve data from the first host as long as the data returned is tiny (<6000 rows)
But it's failing when I'm trying to read large blocks (100k+ rows).
I checked the executor logs and when the reads are successful, I'm getting this following log message:
19/07/23 21:54:17 INFO CassandraConnector: Connected to Cassandra cluster: DataMonitor
19/07/23 21:54:17 INFO Executor: Finished task 0.0 in stage 1.0 (TID 4). 1014673 bytes result sent to driver
19/07/23 21:54:24 INFO CassandraConnector: Disconnected from Cassandra cluster: DataMonitor
But when the reads are unsuccessful, I'm getting this following log message:
19/07/23 22:21:55 INFO CassandraConnector: Connected to Cassandra cluster: DataMonitor
19/07/23 22:22:03 INFO MemoryStore: Block taskresult_13 stored as bytes in memory (estimated size 119.2 MB, free 2.4 GB)
19/07/23 22:22:03 INFO Executor: Finished task 0.3 in stage 4.0 (TID 13). 124969484 bytes result sent via BlockManager)
19/07/23 22:22:10 INFO CassandraConnector: Disconnected from Cassandra cluster: DataMonitor
It looks like when the results are large enough, it gets "sent via BlockManager"
But when it's small enough, it gets "sent to the driver".
So how do I get it so every results are sent to the driver?
Each Executor runs tasks and sends the result of the task back to the driver.
If a task result is small, it sends it directly with task status, but if the result size is big, calculated by the following formula:
taskResultSize > conf.getSizeAsBytes("spark.task.maxDirectResultSize", 1L << 20)
or
taskResultSize > conf.get("spark.driver.maxResultSize")
source code
Executor stores the result on disk locally and sends IndirectTaskResult with blockId back to the driver.
Then driver uses netty via BlockManager to download the remote result.
Take a look here.
If it is not detailed enough, let me know.

Print Data from Confluent Source and Sink connectors

i have source and sink connectors installed using confluent and they are working fine. but when i see docker logs using
docker logs -f container-name
the output is something like this
[2018-09-19 09:44:06,571] INFO WorkerSourceTask{id=mariadb-source-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:06,571] INFO WorkerSourceTask{id=mariadb-source-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:06,573] INFO WorkerSourceTask{id=mariadb-source-0} Finished commitOffsets successfully in 2 ms (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:16,194] INFO WorkerSinkTask{id=oracle-sink-0} Committing offsets asynchronously using sequence number 1077: {faheemuserdbtest-0=OffsetAndMetadata{offset=7, metadata=''}} (org.apache.kafka.connect.runtime.WorkerSinkTask)
[2018-09-19 09:44:16,574] INFO WorkerSourceTask{id=mariadb-source-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:16,574] INFO WorkerSourceTask{id=mariadb-source-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTas
but it doesn't the actual data going through the topics, is there a way i can print that data in logs? because i'm moving these logs to a kibana dashboard.
yes i can read data from Kafka topic but that is not my scenario.
Depending on the connector, if you enabled TRACE logging in the Connector Log4j properties, you can see the messages.
If using the Confluent's docker images, there are some CONNECT_LOG4J_LOGGERS environment variables for controlling that
If you want the actual JDBC data in Elasticsearch, you'd typically install the Elasticsearch sink, though rather than parse it out of those logs
No, you can't see that data in the logs.
The connectors don't print the actual data copied around. If you have such requirement, you probably would have to change the logging mechanism in the source and sink connector source code and customize it according to your requirement.

HDFS write from kafka : createBlockOutputStream Exception

I'm using Hadoop from docker swarm with 1 namenode and 3 datanodes (on 3 physical machines).
i'm also using kafka and kafka connect + hdfs connector to write messages into HDFS in parquet format.
I'm able to write data to HDFS using HDFS clients (hdfs put).
But when kafka is writing messages, it works at the very beginning, then if fails with this error :
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.8:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
[2018-05-23 10:30:10,125] INFO Abandoning BP-468254989-172.17.0.2-1527063205150:blk_1073741825_1001 (org.apache.hadoop.hdfs.DFSClient:1265)
[2018-05-23 10:30:10,148] INFO Excluding datanode DatanodeInfoWithStorage[10.0.0.8:50010,DS-cd1c0b17-bebb-4379-a5e8-5de7ff7a7064,DISK] (org.apache.hadoop.hdfs.DFSClient:1269)
[2018-05-23 10:31:10,203] INFO Exception in createBlockOutputStream (org.apache.hadoop.hdfs.DFSClient:1368)
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
And then the datanodes are not reachable anymore for the process :
[2018-05-23 10:32:10,316] WARN DataStreamer Exception (org.apache.hadoop.hdfs.DFSClient:557)
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /topics/+tmp/test_hdfs/year=2018/month=05/day=23/hour=08/60e75c4c-9129-454f-aa87-6c3461b54445_tmp.parquet could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1733)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2496)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:828)
But if I look into the hadoop web admin console, all the nodes seem to be up and OK.
I've checked the hdfs-site and the "dfs.client.use.datanode.hostname" setting is set to true both on namenode and datanodes. All ips in hadoop configuration files are defined using 0.0.0.0 addresses.
I've tried to format the namenode too, but the error happened again.
Could the problem be that Kafka is writing too fast in HDFS, so it overwhelms it? It would be weird as I've tried the same configuration on a smaller cluster and it worked good even with a big throughputof kafka messages.
Do you have any other idea of the origin of this problem?
Thanks
dfs.client.use.datanode.hostname=true has to be configured also to the client side and, following your log stack:
java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
I guess 10.0.0.9 refers to a private net IP; thus, it seems that the property is not set in your client within hdfs-client.xml.
You can find more detail here.

Flume agent: How flume agent gets data from a webserver located in different physical server

I am trying to understand Flume and referring to the official page of flume at flume.apache.org
In particular, referring to this section, I am bit confused in this.
Do we need to run the flume agent on the actual webserver or can we run flume agents in a different physical server and acquire data from webserver?
If above is correct, then how flume agent gets the data from webserver logs? How can webserver make its data available to the flume agent ?
Can anyone help understand this?
The Flume agent must pull data from a source, publish to a channel, which then writes to a sink.
You can install Flume agent in either a local or remote configuration. But, keep in mind that having it remote will add some network latency to your event processing, if you are concerned about that. You can also "multiplex" Flume agents to have one remote aggregation agent, then individual local agents on each web server.
Assuming a flume agent is locally installed using a Spooldir or exec source, it'll essentially tail any file or run that command locally. This is how it would get data from logs.
If the Flume agent is setup as a Syslog or TCP source (see Data ingestion section on network sources), then it can be on a remote machine, and you must establish a network socket in your logging application to publish messages to the other server. This is a similar pattern to Apache Kafka.

Resources