Kafka source and HDFS sink in Spring cloud Data flow - spring-cloud-dataflow

I am using Kafka as the source and I want to write the messages on Kafka to HDFS using HDFS sink.But I see the file getting created on the HDFS but the message on Kafka is not written to the HDFS file.Please find below the Stream DSL.
stream create --definition ":streaming > hdfs --spring.hadoop.fsUri=hdfs://127.0.0.1:50071 --hdfs.directory=/ws/output --hdfs.file-name=kafkastream --hdfs.file-extension=txt --spring.cloud.stream.bindings.input.consumer.headerMode=raw" --name mykafkastream
Please help me resolve this.

It could be that the data isn't written to the hdfs disk yet. You can force a flush/sync while you are testing. Try setting --hdfs.enable-sync=true --hdfs.flush-timeout=10000 that way the data is written to hdfs every 10s whether the buffer is full or not.

Related

How to Read Data from s3 into Kafka (Docker image)

I have installed kafka on docker in windows, which is running as below screenshot and I have installed "Amazon S3 Source Connector" by using this link
https://docs.confluent.io/kafka-connect-s3-source/current/index.html#quick-start.
My Questions are :
i)How do I execute command to see all kakfa topics ( not through GUI)
ii)How do I check if S3 source connector is installed property (I mean location or cli..)
iii) And link also specifes to create and use "quickstart-s3source.properties" where do i do this? On my desktop on docker?
Regarding your question title, that source connector does not read arbitrary S3 data, only that written by the S3 sink
execute command to see all kakfa topics
kafka-topics --list
check if S3 source connector is installed
You're only running the datagen connector, which doesn't include the S3 source, but you'd use the /connector-plugins endpoint of the Connect server
specifes to create and use "quickstart-s3source.properties"
You need to convert this to JSON since Docker runs a distributed connect server, and those quickstart files are meant to be used with standalone. It doesn't matter where you create the file as long as it's given to the Connect server via HTTP POST

Print Data from Confluent Source and Sink connectors

i have source and sink connectors installed using confluent and they are working fine. but when i see docker logs using
docker logs -f container-name
the output is something like this
[2018-09-19 09:44:06,571] INFO WorkerSourceTask{id=mariadb-source-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:06,571] INFO WorkerSourceTask{id=mariadb-source-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:06,573] INFO WorkerSourceTask{id=mariadb-source-0} Finished commitOffsets successfully in 2 ms (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:16,194] INFO WorkerSinkTask{id=oracle-sink-0} Committing offsets asynchronously using sequence number 1077: {faheemuserdbtest-0=OffsetAndMetadata{offset=7, metadata=''}} (org.apache.kafka.connect.runtime.WorkerSinkTask)
[2018-09-19 09:44:16,574] INFO WorkerSourceTask{id=mariadb-source-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-09-19 09:44:16,574] INFO WorkerSourceTask{id=mariadb-source-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTas
but it doesn't the actual data going through the topics, is there a way i can print that data in logs? because i'm moving these logs to a kibana dashboard.
yes i can read data from Kafka topic but that is not my scenario.
Depending on the connector, if you enabled TRACE logging in the Connector Log4j properties, you can see the messages.
If using the Confluent's docker images, there are some CONNECT_LOG4J_LOGGERS environment variables for controlling that
If you want the actual JDBC data in Elasticsearch, you'd typically install the Elasticsearch sink, though rather than parse it out of those logs
No, you can't see that data in the logs.
The connectors don't print the actual data copied around. If you have such requirement, you probably would have to change the logging mechanism in the source and sink connector source code and customize it according to your requirement.

HDFS write from kafka : createBlockOutputStream Exception

I'm using Hadoop from docker swarm with 1 namenode and 3 datanodes (on 3 physical machines).
i'm also using kafka and kafka connect + hdfs connector to write messages into HDFS in parquet format.
I'm able to write data to HDFS using HDFS clients (hdfs put).
But when kafka is writing messages, it works at the very beginning, then if fails with this error :
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.8:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
[2018-05-23 10:30:10,125] INFO Abandoning BP-468254989-172.17.0.2-1527063205150:blk_1073741825_1001 (org.apache.hadoop.hdfs.DFSClient:1265)
[2018-05-23 10:30:10,148] INFO Excluding datanode DatanodeInfoWithStorage[10.0.0.8:50010,DS-cd1c0b17-bebb-4379-a5e8-5de7ff7a7064,DISK] (org.apache.hadoop.hdfs.DFSClient:1269)
[2018-05-23 10:31:10,203] INFO Exception in createBlockOutputStream (org.apache.hadoop.hdfs.DFSClient:1368)
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
And then the datanodes are not reachable anymore for the process :
[2018-05-23 10:32:10,316] WARN DataStreamer Exception (org.apache.hadoop.hdfs.DFSClient:557)
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /topics/+tmp/test_hdfs/year=2018/month=05/day=23/hour=08/60e75c4c-9129-454f-aa87-6c3461b54445_tmp.parquet could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1733)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2496)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:828)
But if I look into the hadoop web admin console, all the nodes seem to be up and OK.
I've checked the hdfs-site and the "dfs.client.use.datanode.hostname" setting is set to true both on namenode and datanodes. All ips in hadoop configuration files are defined using 0.0.0.0 addresses.
I've tried to format the namenode too, but the error happened again.
Could the problem be that Kafka is writing too fast in HDFS, so it overwhelms it? It would be weird as I've tried the same configuration on a smaller cluster and it worked good even with a big throughputof kafka messages.
Do you have any other idea of the origin of this problem?
Thanks
dfs.client.use.datanode.hostname=true has to be configured also to the client side and, following your log stack:
java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
I guess 10.0.0.9 refers to a private net IP; thus, it seems that the property is not set in your client within hdfs-client.xml.
You can find more detail here.

Flume agent: How flume agent gets data from a webserver located in different physical server

I am trying to understand Flume and referring to the official page of flume at flume.apache.org
In particular, referring to this section, I am bit confused in this.
Do we need to run the flume agent on the actual webserver or can we run flume agents in a different physical server and acquire data from webserver?
If above is correct, then how flume agent gets the data from webserver logs? How can webserver make its data available to the flume agent ?
Can anyone help understand this?
The Flume agent must pull data from a source, publish to a channel, which then writes to a sink.
You can install Flume agent in either a local or remote configuration. But, keep in mind that having it remote will add some network latency to your event processing, if you are concerned about that. You can also "multiplex" Flume agents to have one remote aggregation agent, then individual local agents on each web server.
Assuming a flume agent is locally installed using a Spooldir or exec source, it'll essentially tail any file or run that command locally. This is how it would get data from logs.
If the Flume agent is setup as a Syslog or TCP source (see Data ingestion section on network sources), then it can be on a remote machine, and you must establish a network socket in your logging application to publish messages to the other server. This is a similar pattern to Apache Kafka.

Flume... How to configure source and sink in different VMs?

I wish to configure the avro SOurce in VM1 and Sink in VM2.
The Source should be able to connect to the channels in VM2 and channels will send data to sink in VM2. How should I proceed to configure this?
Please help.

Resources