Print Data from Confluent Source and Sink connectors - docker

i have source and sink connectors installed using confluent and they are working fine. but when i see docker logs using
docker logs -f container-name
the output is something like this
[2018-09-19 09:44:06,571] INFO WorkerSourceTask{id=mariadb-source-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:06,571] INFO WorkerSourceTask{id=mariadb-source-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:06,573] INFO WorkerSourceTask{id=mariadb-source-0} Finished commitOffsets successfully in 2 ms (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:16,194] INFO WorkerSinkTask{id=oracle-sink-0} Committing offsets asynchronously using sequence number 1077: {faheemuserdbtest-0=OffsetAndMetadata{offset=7, metadata=''}} (org.apache.kafka.connect.runtime.WorkerSinkTask)
[2018-09-19 09:44:16,574] INFO WorkerSourceTask{id=mariadb-source-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:16,574] INFO WorkerSourceTask{id=mariadb-source-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTas
but it doesn't the actual data going through the topics, is there a way i can print that data in logs? because i'm moving these logs to a kibana dashboard.
yes i can read data from Kafka topic but that is not my scenario.

Depending on the connector, if you enabled TRACE logging in the Connector Log4j properties, you can see the messages.
If using the Confluent's docker images, there are some CONNECT_LOG4J_LOGGERS environment variables for controlling that
If you want the actual JDBC data in Elasticsearch, you'd typically install the Elasticsearch sink, though rather than parse it out of those logs

No, you can't see that data in the logs.
The connectors don't print the actual data copied around. If you have such requirement, you probably would have to change the logging mechanism in the source and sink connector source code and customize it according to your requirement.

Related

Understanding filebeat monitoring stats when ingesting netflow traffic

I'm running filebeat 7.14.0 to ingest Netflow data, which is then stored in Elasticsearch and viewed on Kibana. When I run filebeat -e, I will see some logs generated by filebeat every 30s.
I'm trying to understand the stats more. For example, I see
"input":{"netflow":{"flows":1234,"packets":{"dropped":2345,"received":12345}}}}
But each netflow packet contains about 10 netflow records, so when I receive 12345 packets, I would expect 123450 flows, and the stats only show 1234 flows. Does it mean I'm missing a lot of flows?
For Better understanding of logs , enable logging in debug mode and add
Logging.Selector : ["input"]
This will show you stats per sec in the logs .
Add grep "stats" while checking the log to easily check the stats .
This will show you Flows and packets per second

Running Kafka connect in standalone mode, having issues with offsets

I am using this Github repo and folder path I found: https://github.com/entechlog/kafka-examples/tree/master/kafka-connect-standalone to run Kafka connect locally in standalone mode. I have made some changes to the Docker compose file but mainly changes that pertain to authentication.
The problem I am now having is that when I run the Docker image, I get this error multiple times, for each partition (there are 10 of them, 0 through 9):
[2021-12-07 19:03:04,485] INFO [bq-sink-connector|task-0] [Consumer clientId=connector- consumer-bq-sink-connector-0, groupId=connect-bq-sink-connector] Found no committed offset for partition <topic name here>-<partition number here> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1362)
I don't think there are any issues with authenticating or connecting to the endpoint(s), I think the consumer (connect sink) is not sending the offset back.
Am I missing an environment variable? You will see this docker compose file has CONNECT_OFFSET_STORAGE_FILE_FILENAME: /tmp/connect.offsets, and I tried adding CONNECTOR_OFFSET_STORAGE_FILE_FILENAME: /tmp/connect.offsets (CONNECT_ vs. CONNECTOR_) and then I get an error Failed authentication with <Kafka endpoint here>, so now I'm just going in circles.
I think you are focused on the wrong output.
That is an INFO message
The offsets file (or topic in distributed mode) is for source connectors.
Sink connectors use consumer groups. If there is no found offset found for groupId=connect-bq-sink-connector, then the consumer group didn't commit it.

Logging Data via Docker Logs - .NET Core 3.1 Filebeat ELK

I am using Docker logs with the standard driver to show dummy data from a .Net Core web app via docker logs [options]
More specifically, I am using ILogger<>
logger.LogDebug("Hello from web app");
The output is:
2020-02-03T15:29:02.378269378Z dbug: WebAppLogger.Startup[0]
2020-02-03T15:29:02.378353836Z Hello from Configure
Eventually I want to use Filebeat in conjunction with ELK to send these logs to, but the examples I have found are more like this:
11/June/2019:00:10:45 + 0000 DEBUG "This is a generic log message example"
Mine seems to go onto 2 lines with two slightly different times (milliseconds). Is there a more apt logging library for this, or is it something I am doing wrong?

HDFS write from kafka : createBlockOutputStream Exception

I'm using Hadoop from docker swarm with 1 namenode and 3 datanodes (on 3 physical machines).
i'm also using kafka and kafka connect + hdfs connector to write messages into HDFS in parquet format.
I'm able to write data to HDFS using HDFS clients (hdfs put).
But when kafka is writing messages, it works at the very beginning, then if fails with this error :
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.8:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
[2018-05-23 10:30:10,125] INFO Abandoning BP-468254989-172.17.0.2-1527063205150:blk_1073741825_1001 (org.apache.hadoop.hdfs.DFSClient:1265)
[2018-05-23 10:30:10,148] INFO Excluding datanode DatanodeInfoWithStorage[10.0.0.8:50010,DS-cd1c0b17-bebb-4379-a5e8-5de7ff7a7064,DISK] (org.apache.hadoop.hdfs.DFSClient:1269)
[2018-05-23 10:31:10,203] INFO Exception in createBlockOutputStream (org.apache.hadoop.hdfs.DFSClient:1368)
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
And then the datanodes are not reachable anymore for the process :
[2018-05-23 10:32:10,316] WARN DataStreamer Exception (org.apache.hadoop.hdfs.DFSClient:557)
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /topics/+tmp/test_hdfs/year=2018/month=05/day=23/hour=08/60e75c4c-9129-454f-aa87-6c3461b54445_tmp.parquet could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1733)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2496)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:828)
But if I look into the hadoop web admin console, all the nodes seem to be up and OK.
I've checked the hdfs-site and the "dfs.client.use.datanode.hostname" setting is set to true both on namenode and datanodes. All ips in hadoop configuration files are defined using 0.0.0.0 addresses.
I've tried to format the namenode too, but the error happened again.
Could the problem be that Kafka is writing too fast in HDFS, so it overwhelms it? It would be weird as I've tried the same configuration on a smaller cluster and it worked good even with a big throughputof kafka messages.
Do you have any other idea of the origin of this problem?
Thanks
dfs.client.use.datanode.hostname=true has to be configured also to the client side and, following your log stack:
java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
I guess 10.0.0.9 refers to a private net IP; thus, it seems that the property is not set in your client within hdfs-client.xml.
You can find more detail here.

Flume agent: How flume agent gets data from a webserver located in different physical server

I am trying to understand Flume and referring to the official page of flume at flume.apache.org
In particular, referring to this section, I am bit confused in this.
Do we need to run the flume agent on the actual webserver or can we run flume agents in a different physical server and acquire data from webserver?
If above is correct, then how flume agent gets the data from webserver logs? How can webserver make its data available to the flume agent ?
Can anyone help understand this?
The Flume agent must pull data from a source, publish to a channel, which then writes to a sink.
You can install Flume agent in either a local or remote configuration. But, keep in mind that having it remote will add some network latency to your event processing, if you are concerned about that. You can also "multiplex" Flume agents to have one remote aggregation agent, then individual local agents on each web server.
Assuming a flume agent is locally installed using a Spooldir or exec source, it'll essentially tail any file or run that command locally. This is how it would get data from logs.
If the Flume agent is setup as a Syslog or TCP source (see Data ingestion section on network sources), then it can be on a remote machine, and you must establish a network socket in your logging application to publish messages to the other server. This is a similar pattern to Apache Kafka.

Resources