How to write a custom ES sink in Flume 1.7 - flume

In the Flume agent I am collection the elements from Kafka topics and I need to insert them in ES. However I need to perform a previous digestion process in the sink, so I need to write a custom sink to pass the data from the Agent's channel to a java digestion module (which I have written already).
Can anyone share with me a template of a custom sink and can use as a reference? Flumes official website doesn't say much about this topic:
A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. The type of the custom sink is its FQCN.
https://flume.apache.org/FlumeUserGuide.html#custom-sink
And once the custom sink is ready, How could I link the following three files to make the agent work:
custom sink
ingestion jar (java module to perform the ingestion process)
FlumeAgent.properties
Thank you for any feedback. I will keep adding information as soon as I progress in this task.

Hope you are trying to use Flume to recieve events from Kafka (source) and forwarding it to ES (sink) with some data processing logic already you have.
With this understanding, I would suggest you to look into Flume interceptors which is responsible for altering/filtering the events on fly before sending to Sink.
So all your business logic to alter the events can be implemented as a custom interceptor and it should be configured to the Flume channel.
For reference you can checkout the native interceptors source code already available. This should probably give you an idea on the Flume interceptor framework.
Here is the ES Sink source code
Sample Flume config
a1.sources = kafkaSource
a1.sinks = ES_Sink
a1.channels = channel1
a1.sources.kafkaSource.interceptors = i1
a1.sources.kafkaSource.interceptors.i1.type = org.apache.flume.interceptor.<Custom_Interceptor_name>$Builder
a1.sinks.ES_Sink.channel = channel1
a1.sinks.ES_Sink.type = elasticsearch
a1.sinks.ES_Sink.hostNames = 127.0.0.1:9200

Related

How to use Apache beam connector without running inside pipeline

We are running our program inside a kubernetes pod, which is listening to pubsub message. Based on message data type it launches dataflow job. And once job execution finishes, we again send pubsub message to another system.
Pipeline is launched in batch mode and it read from GCS and after processing write to GCS.
Pipeline pipeline = Pipeline.create(options);
PCollection<String> read = pipeline
.apply("Read from GCS",
TextIO.read().from("GCS_PATH").withCompression(Compression.GZIP));
//process
// write to GCS
....
PipelineResult result = pipeline.run();
result.waitUntilFinish();
# send job completed message to Pubsub to other component
....
....
As I have to send event to other components in the system. As of now I am using Pubsbub java client library to push message to pubsub.
Is there a way, I can use apache Pubsub connector to send message like below -
Or what is the right way to do the same
PubsubIO.writeMessages().to("topicName");
To solve this usecase you can use the Wait API. Details can be found here
PCollection<Void> firstWriteResults = data.apply(ParDo.of(...write to first database...));
data.apply(Wait.on(firstWriteResults))
// Windows of this intermediate PCollection will be processed no earlier than when
// the respective window of firstWriteResults closes.
.apply(ParDo.of(...write to second database...));

ROS - How do I publish a message and get the subscribed callback immediately

I have a ROS node that allows you to "publish" a data structure to it, to which it responds by publishing an output. The timestamp of what I published and what it publishes is matched.
Is there a mechanism for a blocking function where I send/publish and output, and it waits until I receive an output?
I think you need the ROS_Services (client/server) pattern instead of the publisher/subscriber.
Here is a simple example to do that in Python:
Client code snippet:
import rospy
from test_service.srv import MySrvFile
rospy.wait_for_service('a_topic')
try:
send_hi = rospy.ServiceProxy('a_topic', MySrvFile)
print('Client: Hi, do you hear me?')
resp = send_hi('Hi, do you hear me?')
print("Server: {}".format(resp.response))
except rospy.ServiceException, e:
print("Service call failed: %s"%e)
Server code snippet:
import rospy
from test_service.srv import MySrvFile, MySrvFileResponse
def callback_function(req):
print(req)
return MySrvFileResponse('Hello client, your message received.')
rospy.init_node('server')
rospy.Service('a_topic', MySrvFile, callback_function)
rospy.spin()
MySrvFile.srv
string request
---
string response
Server out:
request: "Hi, do you hear me?"
Client out:
Client: Hi, do you hear me?
Server: Hello client, your message received.
Learn more in ros-wiki
Project repo on GitHub.
[UPDATE]
If you are looking for fast communication, TCP-ROS communication is not your purpose because it is slower than a broker-less communicator like ZeroMQ (it has low latency and high throughput):
ROS-Service pattern equivalent in ZeroMQ is REQ/REP (client/server)
ROS publisher/subscriber pattern equivalent in ZeroMQ is PUB/SUB
ROS publisher/subscriber with waitformessage equivalent in ZeroMQ is PUSH/PULL
ZeroMQ is available in both Python and C++
Also, to transfer huge amounts of data (e.g. pointcloud), there is a mechanism in ROS called nodelet which is supported only in C++. This communication is based on shared memory on a machine instead of TCP-ROS socket.
What exactly is a nodelet?
Since you want to stick with publish/ subscribers, assuming from your comment, that services are to slow I would have a look at waitForMessage (Documentation).
And for an example on how to use it you can have a look at this ros answers question.
All you need to do is to publish your data and immediately call waitForMessage on the output topic and manually pass the received message to your "callback".
I hope this is what you were looking for.
To get this request/reply behaviour ROS has a mechanism called ROS service.
You can specify the input and output of your service in a service file similar to a ROS message definition. You can then call the service of a node with your input and the call will receive an output when the service is finished.
Here is a tutorial how to use this mechanism in python. If you prefer C++ there is also one, you should find it.

Launching composed task built by DSL from stream application

Every example I've seen (task-launcher sink and triggertask source ) shows how to launch the task defined by uri attribute.
My tasks definitions look like this :
sampleTask <t2: timestamp || t1: timestamp>
sampleTask-t1 timestamp
sampleTask-t2 timestamp
sampleTaskRunner composed-task-runner --graph=sampleTask
My question is how do I launch the composed task runner (sampleTaskRunner, defined by DSL) from stream application.
Thanks
UPDATE
I ended up with the below solution that triggers task using SCDF REST API :
composedTask definition :
<timestamp || mySampleTask>
Stream definition :
http | httpclient | log
Deployment properties :
app.http.port=81
app.httpclient.body=name=composedTask&arguments=--increment-instance-enabled=true
app.httpclient.http-method=POST
app.httpclient.url=http://localhost:9393/tasks/executions
app.httpclient.headers-expression={'Content-Type':'application/x-www-form-urlencoded'}
Though it's easy to implement http sink component, would be great if stream application starters will provide one out of the box.
Another concern I have is about discovering the SCDF REST URL when deployed in distributed environment.
Here's a quick take from one of the SCDF's R&D team members (Glenn Renfro).
stream create foozer --definition "trigger --fixed-delay=5 | tasklaunchrequest-transform --uri=maven://org.springframework.cloud.task.app:composedtaskrunner-task:1.1.0.BUILD-SNAPSHOT --command-line-arguments='--graph=sampleTask-t1||sampleTask-t2 --increment-instance-enabled=true --spring.datasource.url=jdbc:mariadb://localhost:3306/test --spring.datasource.username=root --spring.datasource.password=password --spring.datasource.driverClassName=org.mariadb.jdbc.Driver' | task-launcher-local" --deploy
In the foozer stream definition,
1) "trigger" source happens to trigger an upstream event every 5s
2) "tasklaunchrequest-transform" processor takes a few arguments; more specifically, it uses "composedtaskrunner-task:1.1.0.BUILD-SNAPSHOT" to launch a composed-task graph (i.e., sampleTask-t1||sampleTask-t2)
3) Pay attention to --increment-instance-enabled. This was recently added to CTR application and this provides the ability to re-launch a composed-task in a recurring cadence
4) Since the CTR and SCDF must share the same database, we are also passing datasource properties as command-line args. (SCDF-server is already started with the same datasource credentials)
Hope this helps.
Lastly, we will add a sample to the reference guide via: spring-cloud/spring-cloud-dataflow#1780

How to test apache flume load balancing - Sink groups

I am kind of newbie to apache flume , I have configured single tier agent with sink group -load balance manually , I would like to know how can i test the sink group load balancing ? any idea folks
You can define two different sinks and mention them in the Sink Groups as below,
agent1.sinkgroups = g1
agent1.sinkgroups.g1.sinks = HDFS1 HDFS2
agent1.sinkgroups.g1.processor.type = load_balance
agent1.sinkgroups.g1.processor.backoff = true
agent1.sinkgroups.g1.processor.selector = round_robin
Here both of them are HDFS sinks.
You can mention the process selector (round_robin[default], random or custom selector) which defines how should the load be balanced between two sinks.
When you run the agent, you can see that two different set of data is stored in two respective HDFS paths(sinks).
Other two optional parameters are backoff and selector.maxTimeOut
You can refer this link for more info Flume 1.6.0 User Guide

Duplicate channel before being intercepted by interceptor

I'm using flume to do something like this
Source --> interceptor --> Channel --> multiplexing --> HDFS Sink
|-----------> Null Sink
I would like to add a channel just after the source but I don't want event pass through the interceptor. I would like "raw" event. Like this:
Source --> interceptor (i) --> Channel --> multiplexing --> HDFS Sink
| |-----------> Null Sink
|-------> Channel (must no be intercepted by i) --> HDFS
How can I do it ?
Thanks
Since interceptors are configured per source, you will have to add a second source (configured with no interceptors at all and listening in a different Http port), and emit your data twice: one copy for the source with interceptors, and one copy to the other source.
Another possibility is to chain two agents. The first one containing a single source with no interceptors, and two sinks: one for persisting the data as it is in HDFS, and the other feeding the agent you already have. I mean:
src-->ch-->multip-->sink----------->src-->int-->ch-->multip-->hdfssink
|-->hdfssink |-->nullsink
(________agent1____________) (_____________agent2_____________)

Resources