How to test apache flume load balancing - Sink groups - flume

I am kind of newbie to apache flume , I have configured single tier agent with sink group -load balance manually , I would like to know how can i test the sink group load balancing ? any idea folks

You can define two different sinks and mention them in the Sink Groups as below,
agent1.sinkgroups = g1
agent1.sinkgroups.g1.sinks = HDFS1 HDFS2
agent1.sinkgroups.g1.processor.type = load_balance
agent1.sinkgroups.g1.processor.backoff = true
agent1.sinkgroups.g1.processor.selector = round_robin
Here both of them are HDFS sinks.
You can mention the process selector (round_robin[default], random or custom selector) which defines how should the load be balanced between two sinks.
When you run the agent, you can see that two different set of data is stored in two respective HDFS paths(sinks).
Other two optional parameters are backoff and selector.maxTimeOut
You can refer this link for more info Flume 1.6.0 User Guide

Related

TFF: Remote Executor

We are setting up a federated scenario with Server and Client on different physical machines.
On the server, we have used the docker container to kickstart:
The above has been borrowed from Kubernetes tutorial. We believe this creates a 'local executor' [Ref 1] which helps create a gRPC server [Ref 2].
Ref 1:
Ref 2:
Next on the client 1, we are calling tff.framework.RemoteExecutor that connects to the gRPC server.
Our understanding based on the above is that the Remote Executor runs on the client which connects to the gRPC server.
Assuming the above is correct, how can we send a
tff.tf_computation
from the server to the client and print the output on the client side to ensure the whole setup works well.
Your understanding is definitely correct.
If you construct an ExecutorFactory directly, as seems to be the case in the code above, passing it to tff.framework.set_default_context will install your remote stack as the default mechanism for executing computations in the TFF runtime. You should additionally be able to pass the appropriate channels to tff.backends.native.set_remote_execution_context to handle the remote executor construction and context installation if desired, but the way you are doing it certainly works, and allows for greater customization.
Once you have set this up, running an example end-to-end should be fairly simple. We will set up a computation which takes a set of federated integers, prints on the clients, and sums the integers up. Let:
#tff.tf_computation(tf.int32)
def print_and_return(x):
# We must use tf.print here, as this logic will be
# serialized and run on the clients as TensorFlow.
tf.print('hello world')
return x
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def print_and_sum(federated_arg):
same_ints = tff.federated_map(print_and_return, federated_arg)
return tff.federated_sum(same_ints)
Suppose we have N clients; we simply instantiate the set of federated integers, and invoke our computation.
federated_ints = [1] * N
total = print_and_sum(federated_ints)
assert total == N
This should cause the tf.prints defined above to run on the remote machine; as long as tf.print is directed to an output stream which you can monitor, you should be able to see it.
PS: you may note that the federated sum above is unnecessary; it certainly is. The same effect can be had by simply mapping the identity function with the serialized print.

Launching composed task built by DSL from stream application

Every example I've seen (task-launcher sink and triggertask source ) shows how to launch the task defined by uri attribute.
My tasks definitions look like this :
sampleTask <t2: timestamp || t1: timestamp>
sampleTask-t1 timestamp
sampleTask-t2 timestamp
sampleTaskRunner composed-task-runner --graph=sampleTask
My question is how do I launch the composed task runner (sampleTaskRunner, defined by DSL) from stream application.
Thanks
UPDATE
I ended up with the below solution that triggers task using SCDF REST API :
composedTask definition :
<timestamp || mySampleTask>
Stream definition :
http | httpclient | log
Deployment properties :
app.http.port=81
app.httpclient.body=name=composedTask&arguments=--increment-instance-enabled=true
app.httpclient.http-method=POST
app.httpclient.url=http://localhost:9393/tasks/executions
app.httpclient.headers-expression={'Content-Type':'application/x-www-form-urlencoded'}
Though it's easy to implement http sink component, would be great if stream application starters will provide one out of the box.
Another concern I have is about discovering the SCDF REST URL when deployed in distributed environment.
Here's a quick take from one of the SCDF's R&D team members (Glenn Renfro).
stream create foozer --definition "trigger --fixed-delay=5 | tasklaunchrequest-transform --uri=maven://org.springframework.cloud.task.app:composedtaskrunner-task:1.1.0.BUILD-SNAPSHOT --command-line-arguments='--graph=sampleTask-t1||sampleTask-t2 --increment-instance-enabled=true --spring.datasource.url=jdbc:mariadb://localhost:3306/test --spring.datasource.username=root --spring.datasource.password=password --spring.datasource.driverClassName=org.mariadb.jdbc.Driver' | task-launcher-local" --deploy
In the foozer stream definition,
1) "trigger" source happens to trigger an upstream event every 5s
2) "tasklaunchrequest-transform" processor takes a few arguments; more specifically, it uses "composedtaskrunner-task:1.1.0.BUILD-SNAPSHOT" to launch a composed-task graph (i.e., sampleTask-t1||sampleTask-t2)
3) Pay attention to --increment-instance-enabled. This was recently added to CTR application and this provides the ability to re-launch a composed-task in a recurring cadence
4) Since the CTR and SCDF must share the same database, we are also passing datasource properties as command-line args. (SCDF-server is already started with the same datasource credentials)
Hope this helps.
Lastly, we will add a sample to the reference guide via: spring-cloud/spring-cloud-dataflow#1780

How to write a custom ES sink in Flume 1.7

In the Flume agent I am collection the elements from Kafka topics and I need to insert them in ES. However I need to perform a previous digestion process in the sink, so I need to write a custom sink to pass the data from the Agent's channel to a java digestion module (which I have written already).
Can anyone share with me a template of a custom sink and can use as a reference? Flumes official website doesn't say much about this topic:
A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. The type of the custom sink is its FQCN.
https://flume.apache.org/FlumeUserGuide.html#custom-sink
And once the custom sink is ready, How could I link the following three files to make the agent work:
custom sink
ingestion jar (java module to perform the ingestion process)
FlumeAgent.properties
Thank you for any feedback. I will keep adding information as soon as I progress in this task.
Hope you are trying to use Flume to recieve events from Kafka (source) and forwarding it to ES (sink) with some data processing logic already you have.
With this understanding, I would suggest you to look into Flume interceptors which is responsible for altering/filtering the events on fly before sending to Sink.
So all your business logic to alter the events can be implemented as a custom interceptor and it should be configured to the Flume channel.
For reference you can checkout the native interceptors source code already available. This should probably give you an idea on the Flume interceptor framework.
Here is the ES Sink source code
Sample Flume config
a1.sources = kafkaSource
a1.sinks = ES_Sink
a1.channels = channel1
a1.sources.kafkaSource.interceptors = i1
a1.sources.kafkaSource.interceptors.i1.type = org.apache.flume.interceptor.<Custom_Interceptor_name>$Builder
a1.sinks.ES_Sink.channel = channel1
a1.sinks.ES_Sink.type = elasticsearch
a1.sinks.ES_Sink.hostNames = 127.0.0.1:9200

How to count total number of rows in a file using google dataflow

I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as
int getCount(String fileName) {}
So, above method will return total count of rows and its implementation will be dataflow code.
Thanks
Seems like your use case is one that doesn't require distributed processing, because the file is compressed and hence can not be read in parallel. However, you may still find it useful to use Dataflow APIs for the sake of their ease of access to GCS and automatic decompression.
Since you also want to get the result out of your pipeline as an actual Java object, you need to use the Direct runner, which runs in-process, without talking to the Dataflow service or doing any distributed processing, however in return it provides the ability to extract PCollection's into Java objects:
Something like this:
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);

What causes flume with GCS sink to throw a OutOfMemoryException

I am using flume to write to Google Cloud Storage. Flume listens on HTTP:9000. It took me some time to make it work (add gcs libaries, use a credentials file...) but now it seems to communicate over the network.
I am sending very small HTTP request for my tests, and I have plenty of RAM available:
curl -X POST -d '[{ "headers" : { timestamp=1417444588182, env=dev, tenant=myTenant, type=myType }, "body" : "some body ONE" }]' localhost:9000
I encounter this memory exception on first request (then of course, it stops working):
2014-11-28 16:59:47,748 (hdfs-hdfs_sink-call-runner-0) [INFO - com.google.cloud.hadoop.util.LogUtil.info(LogUtil.java:142)] GHFS version: 1.3.0-hadoop2
2014-11-28 16:59:50,014 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:467)] process failed
java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:76)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:79)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:820)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
(see complete stack trace as a gist for full details)
The strange part is that folders and files are created the way I want, but files are empty.
gs://my_bucket/dev/myTenant/myType/2014-12-01/14-36-28.1417445234193.json.tmp
Is it something wrong with the way I configured flume + GCS or is it a bug in GCS.jar ?
Where should I check to gather more data ?
ps : I am running flume-ng inside docker.
My flume.conf file:
# Name the components on this agent
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem
# Describe/configure the source
a1.sources.http.type = org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000
# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = gs://my_bucket/%{env}/%{tenant}/%{type}/%Y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 10000
a1.channels.mem.transactionCapacity = 1000
# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem
related question in my flume/gcs journey: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?
When uploading files, the GCS Hadoop FileSystem implementation sets aside a fairly large (64MB) write buffer per FSDataOutputStream (file open for write). This can be changed by setting "fs.gs.io.buffersize.write" to a smaller value, in bytes, in core-site.xml. I imagine 1MB would suffice for low-volume log collection.
In addition, check what the maximum heap size is set to when launching the JVM for flume. The flume-ng script sets a default JAVA_OPTS value of -Xmx20m to limit the heap to 20MB. This can be set to a larger value in flume-env.sh (see conf/flume-env.sh.template in the flume tarball distribution for details).

Resources